Private identity agreement for private set functionalities 1

Abstract

Private set intersection and related functionalities are among the most prominent real-world applications of secure multiparty computation. While such protocols have attracted significant attention from the research community, other functionalities are often required to support a PSI application in practice. For example, in order for two parties to run a PSI over the unique users contained in their databases, they might first invoke a support functionality to agree on the primary keys to represent their users.

This paper studies a secure approach to agreeing on primary keys. We introduce and realize a functionality that computes a common set of identifiers based on incomplete information held by two parties, which we refer to as private identity agreement, and we prove the security of our protocol in the honest-but-curious model. We explain the subtleties in designing such a functionality that arise from privacy requirements when intending to compose securely with PSI protocols. We also argue that the cost of invoking this functionality can be amortized over a large number of PSI sessions, and that for applications that require many repeated PSI executions, this represents an improvement over a PSI protocol that directly uses incomplete or fuzzy matches.

Keywords

Private set intersection private identity agreement garbled circuits

1. Introduction

In recent years Private Set Intersection (PSI) and related two-party protocols have been deployed in real-world applications [21]. In the simplest setting of PSI, each party has a set $X_{i}$ as its input, and the output will be the intersection $⋂ X_{i}$ . More generally the parties may wish to compute some function f over the intersection and obtain output $f (⋂ X_{i})$ [8,13,21,27,28].

Owing to its importance in real-world applications, PSI has been the topic of a significant body of research. Common PSI paradigms include DDH-style protocols [1,10,19,22,31], approaches based on oblivious transfer [12,26,29,30] or oblivious polynomial evaluation [9,14], and approaches based on garbled circuits [18,26,27]. Performance improvements have been dramatic, especially the computational overhead of PSI.

State-of-the-art PSI protocols require exact matches to compute the intersection; in other words, the intersection is based on bitwise equality. In real-world application scenarios the parties may not have inputs that match exactly. As an example, consider the case of two centralized electronic medical record (EMR) providers, which supply and aggregate medical records for medical practitioners, who wish to conduct a study about the number of patients who develop a particular disease after their recent medical histories indicate at-risk status. The EMR providers could use a PSI protocol to count the total number of unique diagnoses among their collective patients. Unfortunately, the EMR providers may not have the same set of information about each patient in their databases; for example, one might identify Alice by her street address and phone number, while the other might use her phone number and email address. Further complicating matters, Bob could use “bob@email.com” for one provider, but “BobDoe123@university.edu” for another.

It may appear that naively applying PSI to each column in two parties’ databases would allow them to realize their desired functionality, but such an approach has many flaws. For example, in the case that individuals use different identifying information for the different services, this approach could incur false negatives. To remedy this issue, there has been previous research on the private record linkage problem, in which “fuzzy matches” between records are permitted [16,33]. In this problem, two rows from different parties’ databases can be said to match if they satisfy some closeness relation, for example by matching approximately on t out of n columns. However, fuzzy matching PSI protocols are not as performant as exact-matching protocols.

As a design goal, we consider applications in which two parties would like to run PSI many times over respective databases. In our EMR example, the rows comprising users change slowly as new patients enter the system and some are expunged. However, auxiliary medical data could change frequently, at least daily. If the EMR providers wish to continuously update their medical models or run multiple analyses, they may run many PSI instances with auxiliary data [21].

In general, for many applications it is desirable for two parties to run PSI-style protocols many times over their respective data sets, and in this work we assume the parties will perform many joint queries. It is therefore advantageous for the parties to first to establish a new column for their databases, containing a single key for each row that can be used for the most performant exact-match PSI protocols.

As a second design goal, we relax an assumption that is standard for the private record linkage problem. We believe that it is not always realistic in practice to assume or to ensure that each participant’s database uniquely maps its rows to identities. For example, one EMR provider may unknowingly have multiple records about the same person in its database, as a result of that person providing different identifying information to different medical providers. As part of a correct protocol, some preprocessing phase must identify records that belong to the same individual – using both parties’ records – and group them accordingly. This is especially important for PSI applications that compute aggregate statistics.

This correctness requirement introduces an additional privacy concern. Consider the case in which party A has a single row in its database that matches more than one row in party B’s database. Naively running a protocol to produce primary keys which link records would inevitably reveal some information to one of the parties. Either party A would learn that the party B is unaware that its rows could be merged, or party B would learn that it has several rows that correspond to a single person. Either way, one party will learn more about the other party’s input than it should.

This work focuses on resolving the apparent trade-offs in privacy and performance between state-of-the-art exact-matching and fuzzy-matching PSI protocols. Our approach is to design a new two-party protocol that computes a new identifier for every row in both databases that will give exact matches. To avoid the additional leakage problem described above, our protocol outputs either (a) shares of the new identifiers, or (b) encryptions of the new identifiers for a generic CPA-secure encryption scheme with XOR homomorphism, which can be decrypted with a key held by the other party. (Our protocol can also output both share and encryptions, and we in fact prove security in the case that it outputs both.) The regime of PSI protocols that can be composed with our protocol is limited to those that can combine shares or decrypt online without revealing the plaintext to either party. However, the flexibility we provide in producing outputs offers flexibility to the design of PSI protocols which can be composed with ours. Additionally, although our identifier-agreement protocol is computationally intensive compared to the subsequent PSI protocol, we argue that this is a one-time cost that can be amortized over many PSI computations.

1.1. Our contributions

This work addresses two problems: (1) The performance and accuracy tradeoffs between exact matching PSI and fuzzy matching PSI protocols. (2) The correctness and privacy problems introduced to PSI by the possibility of poorly defined rows. We address both of these problems in one shot by defining a functionality that computes shared primary keys for two parties’ databases, such that the keys can be used multiple times as inputs to successive efficient PSI protocols, without revealing the keys to the parties. We refer to our stated problem as the private identity agreement functionality, and define it formally. We additionally discuss the security implications of composing our identity agreement functionality and subsequent PSI functionalities. We note that identity agreement is substantially more complex than private set intersection and private record linkage because of the concerns introduced by producing an intermediate output of a larger functionality.

After defining the identity agreement problem, we present a novel two-party protocol that solves the problem in the honest-but-curious model. We additionally describe a modification to our protocol that allows the outputs to naturally compose with DDH-style PSI protocols. Finally we present performance of a prototype implementation.

1.2. Technical overview

1.2.1. Agreement with record linkage

Imagine there exists a universe $U$ of identifying information for a set of individuals, and that there exist two parties $p_{1}$ and $p_{2}$ for which each party $p_{}$ has a database $D_{p_{}} \subset U$ . Consider the problem of constructing a mapping $Λ : U \to {0, 1}^{ℓ}$ such that for every two pieces of information $u_{1}$ and $u_{2}$ belonging to the same individual, $Λ (u_{1}) = Λ (u_{2})$ , and for this mapping to be known to party $p_{}$ for every piece of information $u \in D_{p_{}}$ , while $p_{}$ knows nothing about Λ for $U ∖ D_{p_{}}$ .

The naive approach is simply to perform a PSI protocol to discover all the elements in $D_{p_{1}} \cap D_{p_{2}}$ ; for each element in the intersection, the parties randomly generate some label that serves as $Λ (u)$ for each $u \in D_{p_{1}} \cap D_{p_{2}}$ . Additionally, it is possible to randomly generate a label for each element $v \in D_{p_{1}} \cup D_{p_{2}} ∖ D_{p_{1}} \cap D_{p_{2}}$ , such that each party learns a random label for every element in its database which is not in the other party’s database. Intuitively, random labels are generated so that a party $p_{}$ cannot tell if $u \in D_{p_{1}} \cap D_{p_{2}}$ from the label itself. Indeed, this approach is very similar to that of Buddhavarapu et al. [6].

However, this does not suffice for our problem, precisely due to the issue of linking records. It may be the case that some party has two pieces of information in its database which belong to the same identity, but that it does not know that they belong to the same identity; however, there may be information in the other party’s database that can be used to link the two pieces of information. For example, Alice’s database may have two separate rows (john@email.com, 888-867-5309) and (john@university.edu, 123 First Street), while Bob’s database has a single row (john@email.com, 123 First Street) that links the two rows from Alice’s database (under the assumption that every piece of identifying information belongs to only a single identity). Therefore, although PSI suffices for identifying the elements in both databases, and it is possible to generate random labels for items not in the intersection, we must additionally perform some computation that groups pieces of information that belong to the same identity, without revealing to either party whether this has happened.

1.2.2. Approach

We model the id agreement problem as a graph problem, as follows. For the universe $U$ , there exists a corresponding graph $G_{U}$ in which every piece of information is a vertex, and there exists an edge between every two pieces of information that belong to the same individual. For each party’s database there exists a corresponding graph $G_{p_{}} \subset G_{U}$ such that for every piece of identifying information in $p_{}$ ’s database there is a vertex in $G_{p_{}}$ , and between any two vertices which $p_{}$ believes belong to the same individual, there is an edge in $G_{p_{}}$ .

We explain in Section 4.2 that assigning a unique label to each component in $G_{p_{1}} \cup G_{p_{2}}$ suffices for computing the map Λ. We perform an iterative procedure on the two parties’ graphs as follows. First, both parties assign labels to all of the vertices in their graphs, where each label is computed based on the piece of information it represents in the party’s respective database; this labeling step requires that if $u \in G_{p_{1}}$ and $u \in G_{p_{2}}$ , then u is assigned the same label in $G_{p_{1}}$ and $G_{p_{2}}$ . If a vertex u is in only $G_{p_{1}}$ or $G_{p_{2}}$ but not both, then u’s label is with high probability unique amongst all labels that are assigned.

Next, we use PSI to identify the vertices in the intersection of the two parties’ graphs. (Correspondingly, we perform PSI to identify the elements in the intersection of the two parties’ databases, and assign common labels to those elements.) We then perform computations on the graph structure in order to “fix” the vertices so that every vertex in the same component of $G_{p_{1}} \cup G_{p_{2}}$ is assigned the same label. The crucial aspect of this computation is that a party $p_{}$ may assign the same label two vertices in its own graph $G_{p_{}}$ which the party does not know are connected in $G_{p_{1}} \cup G_{p_{2}}$ . This must be done obliviously to $p_{}$ in order to prevent leaking that the other party has “connecting information” which link two of $p_{}$ ’s vertices. (Correspondingly, we perform oblivious computations to group elements in a party’s database that belong to the same identity, but the party does not know they belong to the same identity.)

Our challenge is to perform the “fixing” step obliviously. Our approach requires a black-box operator that allows us to perform conditional updates on the labels that we assign to each vertex. For this, we turn to garbled circuits. At a high level, we design a garbled circuit that iteratively performs a two step process. First, the circuit performs PSI (borrowing a garbled-circuit PSI subroutine from [18]) to assigns matching labels to all vertices in the intersection of two parties’ graphs. The garbled circuit then performs conditional updates on the labels of every vertex in the parties’ graphs to ensure that vertices in the same component of $G_{p_{1}} \cup G p_{2}$ are assigned the same label. We show that given enough iterations of our procedure, we correctly implement the “fixing” step without revealing private information of either party.

1.3. Related work

Private set intersection with “fuzzy” matches has been considered in previous research. An early work by Freedman, Nissim, and Pinkas on PSI included a proposed fuzzy matching protocol based on oblivious polynomial evaluation [14]. Unfortunately that protocol had a subtle flaw identified by Chmielewski and Hoepman, who proposed solutions based on OPE, secret sharing, and private Hamming distance [7].

Concurrently with our work, Buddhavarapu et al. [6] presented private matching for compute, which addresses the problem of producing intermediate identifiers for repeated private set functionalities. They use different techniques and parallelize their implementation, which yields very good performance. However, they do not address the issues incurred by linking records between data sets, which is the most technically challenging and expensive aspect of the problem we consider.

Wen and Dong presented a protocol solving the private record linkage problem, which is similar to the common identifiers problem in this work [33]. In that setting the goal is to determine which records in several databases correspond to the same person, and to then reveal those records. Wen and Dong present two approaches, one for exact matches using the garbled bloom filter technique from previous work on PSI [12] and one for fuzzy matches that uses locality-sensitive hash functions [20] to build the bloom filter. One important difference between the PRL setting and ours is that our privacy goal requires the matches and non-matches to remain secret from both parties. We also assign a label to each record, with the property that when two records match they are assigned the same label.

Huang, Evans, and Katz compared the performance of custom PSI protocols to approaches based on garbled circuits [18]. One of their constructions, which they call sort-compare-shuffle, is a repeated subroutine in our construction. Unlike their constructions, our output is not a set intersection.

2. Problem definition

Our setting assumes two parties, each holding some database, who wish to engage in inner-join style queries on their two databases, which we refer to as the private joint-database query functionality $F^{Query}$ . The join will be over some subset of columns, and will be a disjunction i.e. two rows are matched if any of the columns in the join match. In Fig. 1 we present the ideal private joint-database query functionality.

Fig. 1.

Query functionality $F^{Query}$ , which receives two parties’ databases and responds to queries over functions of the databases.

Fig. 2.

ID agreement functionality $F^{ID}$ .

We consider a scenario in which it is advantageous for the parties to first establish a new database column containing keys for each record. We refer to this as the private identity agreement functionality, denoted $F^{ID}$ and described in Fig. 2. As we have explained, establishing these keys is a setup phase in a general protocol that realizes $F^{Query}$ . After establishing keys, they can be used for many subsequent exact-match PSI protocols.

Importantly, the newly established keys should not be revealed to either party, as this could also reveal information about the other party’s input. This makes it impossible to separate the protocol for $F^{ID}$ from the subsequent PSI-style protocols that the parties will use for their joint queries. We must there modify any PSI-style protocols as well to ensure a secure composition with $F^{ID}$ . Specifically, $F^{ID}$ is required to produce a sharing or encryption of the computed keys, and subsequent PSI-style protocols used to evaluate queries must be modified to reconstruct shares or decrypt the encryptions online.

2.1. The identity agreement functionality

We denote the set of possible identifiers that either party may hold by $I = ⨂ I_{i}$ , with each set $I_{i}$ being one column and having $⊥ \in I_{i}$ . To define a “match” we define an equivalence relation $S_{1} \overset{user}{\sim} S_{2}$ as follows: if there exists component of $s_{1, j} \neq ⊥$ of $S_{1}$ and a component $s_{2, k} \neq ⊥$ of $S_{2}$ such that $s_{1, j} = s_{2, k}$ then $S_{1} \overset{user}{\sim} S_{2}$ . In other words, we consider two rows to be equivalent if any of their non-empty columns are equal.2

²
It is possible to establish $\overset{user}{\sim}$ for $S_{1}$ and $S_{2}$ for any binary relation that $s_{1, j}$ and $s_{2, k}$ may satisfy; however, we feel that equality is the most natural, and consider only equality in this work. We remark later to indicate when one could substitute another relation for equality in the construction.

For each party $p_{i}$ with database $D_{p_{i}}$ , we assume for simplicity of exposition that every pair of rows $S_{1}$ and $S_{2}$ satisfies $S_{1} \overset{user}{≁} S_{2}$ . (This means that the party does not have sufficient information to conclude that the two rows represent the same element.) Note, however, that it is possible for $p_{1}$ to have rows $S_{1, 1}$ , $S_{1, 2}$ , and for $p_{2}$ to have a row $S_{2}$ such that $S_{1, 1} \overset{user}{\sim} S_{2} \overset{user}{\sim} S_{1, 2}$ . In such a situation, $p_{1}$ is not aware that its database contains two rows that represent the same element.

The goal of the identity agreement functionality $F^{ID}$ is to compute a map $Λ : D_{p_{1}} \cup D_{p_{2}} \to U$ such that for any $S_{1} \overset{user}{\sim} S_{2}$ , $Λ (S_{1}) = Λ (S_{2})$ , and for all $S_{1} \overset{user}{≁} S_{2}$ , $Λ (S_{1}) \neq Λ (S_{2})$ . As we explain below, the parties will not learn $Λ (S_{i})$ for their respective databases; they will only see encryptions of the map.

We define the privacy goals of $F^{ID}$ in relation to the overall query functionality $F^{Query}$ : to compute some PSI functionality where the intersection is determined by the $\overset{user}{\sim}$ relation. Importantly, if $F^{ID}$ is composed with other protocols to realize $F^{Query}$ , then $F^{ID}$ may not reveal any information about $\overset{user}{\sim}$ to either party. Consider, for example, a situation where $p_{1}$ has in its input $S_{1}$ and $S_{2}$ , and in $p_{2}$ ’s input there is a $S_{*}$ such that $S_{1} \overset{user}{\sim} S_{*} \overset{user}{\sim} S_{2}$ . It should not be the case that $p_{1}$ will learn $S_{1} \overset{user}{\sim} S_{2}$ , beyond what can be infered from the output of the PSI functionality. Likewise, $p_{2}$ should not learn that $p_{1}$ has such elements in its input. If $F^{ID}$ revealed such information, then some party could learn more from the composition of $F^{ID}$ with another functionality than it would learn from querying only $F^{Query}$ .

2.2. The graph labeling functionality

We present the two-party component labeling functionality $F^{lbl}$ in Fig. 3. In the two-party graph labeling problem, each party $p_{i} \in {p_{1}, p_{2}}$ hold a graphs $G_{i}$ which is a subgraph of some larger graph $G_{U}$ . The two parties must assign labels to all of the vertices in their respective graphs, subject to the constraint that every two vertices $v_{1}, v_{2} \in G_{1} \cup G_{2}$ are assigned the same label if and only if they are in the same connected component of $G_{1} \cup G_{2}$ . In Sections 4.1 and 4.2, we present the two-party component labeling problem in detail and show that ID agreement is equivalent to two-party component labeling.

The functionality $F^{lbl}$ is very similar to $F^{ID}$ , except that $F^{lbl}$ produces two secret outputs for each of a party’s vertex labels. For each party $p_{i} \in {p_{1}, p_{2}}$ and for every vertex $v \in G_{i}$ , it sends to $p_{i}$

a CPA-secure encryption of $p_{i}$ ’s $Λ (v)$ , encrypted under a key held by $p_{3 - i}$ , and

a one-time pad of $Λ (v)$ , where the mask is known by $p_{3 - i}$ .

F^{lbl}

also receives a mapping function Λ from the adversary. This allows us to prove security without requiring the functionality to sample component labels.

In Sections 4 and 5, respectively, we present a protocol to realize $F^{lbl}$ and a proof of its security in the honest-but-curious model. A slight modification of the protocol, simply removing one set of one-time pad inputs and the masked outputs, is sufficient to realize $F^{ID}$ (for an appropriate labeling function $F^{lbl}$ ). The proof is analogous to that given in Section 5.

Fig. 3.

Two-party component labeling functionality $F^{lbl}$ .

3. Security primitives and cryptographic assumptions

3.1. Garbled circuits

Garbled circuits were proposed by Andrew Yao [34] as a means to a generic two-party computation protocol. Yao’s protocol consists of two subprotocols: a garbling scheme [5] and an oblivious transfer. Most of the CPU work of a garbled circuit protocol involves symmetric primitives, and as Bellare et al. show, garbling schemes can use a block cipher with a fixed key, further improving performance [5]. A drawback of garbled circuits is that they require as much communication as computation, but this can be mitigated by using garbled circuits to implement efficient subprotocols of a larger protocol.

Our construction makes use of garbled circuits to realize secure function evaluation. The functionality $F^{GC}$ is presented in Fig. 4 (refer to [24] for the proof of security). The functionality takes the description of a circuit and two parties’ inputs, and returns each party’s respective output.

Fig. 4.

2-party secure function evaluation $F^{GC}$ .

3.2. CPA-secure encryption

In this section we give the description of a CPA-secure encryption scheme with XOR-homomorphism. In our construction we use a modification of the ElGamal encryption scheme [15]. We describe the additional homomorphism provided by ElGamal below.

Definition 1 (Xor-Homomorphic CPA-Secure Encryption Scheme).

A CPA-Secure encryption scheme with XOR-homomorphism is a tuple of probabilistic polynomial-time algorithms $(CPA . Gen, CPA . Enc, CPA . Dec, CPA . Xor)$ , where $CPA$ .Gen

Given a security parameter λ, $CPA . Gen (λ)$ returns a public-private key pair $(pk, sk)$ , and specifies a message space $M$ .

CPA

.Enc

Given the public key $pk$ and a plaintext message $m \in M$ , one can compute a ciphertext $c \leftarrow CPA . Enc (pk, m)$ , a CPA-secure encryption of m under $pk$ .

CPA

.Dec

Given the secret key $sk$ and a ciphertext c, $CPA . Dec (c, sk)$ recovers the plaintext m.

CPA

.Xor

Given a ciphertext c which encrypts message m and an additional plaintext message $m^{'}$ , $CPA . xor (c, m^{'})$ produces a ciphertext $c^{'}$ which encrypts $m \oplus m^{'}$ .

El-Gamal encryption. Our construction uses the ElGamal encryption scheme as an implementation of a CPA-Secure scheme in order to compose gracefully with a specific class of protocols. When specifically using ElGamal rather than a generic CPA-secure scheme, we use the notation $ElGl$ in place of $CPA$ .

The ElGamal encryption scheme is CPA-secure under the DDH assumption, and supports homomorphic group operations. If the plaintext space is small, addition in the exponent can also be supported, but decryption in this case requires computing a discrete logarithm. Using the identity element of the group, the homomorphism can be used to re-randomize a ciphertext.

We include the description of the homomorphism as an algorithm of the ElGamal encryption scheme:

ElGl

.Mul

Given the public key $pk$ and a set of ciphertexts ${ElGl . Enc (pk, m_{i})}$ encrypting messages ${m_{i}}$ , one can homomorphically compute a ciphertext encrypting the product of the underlying messages: $ElGl . Enc (pk, \prod_{i} m_{i}) = ElGl . Mul ({ElGl . Enc (pk, m_{i})}_{i})$

In Section 4.4.2, we describe how to use $ElGl . Mul$ to induce an XOR homomorphism by encrypting bit-by-bit. This allows us to avoid performing group operations in the garbled circuit subprotocol and reduces communication cost, at the cost of CPU effort.

4. Secure ID agreement as secure two-party component labeling

In this section, we define a graph problem that we call Two-Party Component Labeling, and provide a reduction between ID agreement and Two-Party Component Labeling. We then describe an algorithm to compute component labeling and a two-party protocol that securely implements it.

4.1. Two-party component labeling

In the two-party component labeling problem, each party $p_{i} \in {p_{1}, p_{2}}$ has a graph $G_{p_{i}} = (V_{p_{i}}, E_{p_{i}})$ , where there exists some universe of vertices V for which $V_{p_{1}} \subset V$ and $V_{p_{2}} \subset V$ . Each party’s graph contains at most N vertices, which are distributed among connected components of size at most m. Both N and m are parameters of the problem. As shorthand, we refer to a connected component as a component. As output, each party assigns a label to every component in its graph. If there are two components $C_{1} \in G_{p_{1}}$ and $C_{2} \in G_{p_{2}}$ for which $C_{1}$ and $C_{2}$ have a non-empty intersection, then $p_{1}$ and $p_{2}$ must assign the same label to $C_{1}$ and $C_{2}$ . Just as we explained with ID Agreement, this property induces a transitive relation. If two vertices $v \in G_{p_{1}}$ and $u \in G_{p_{2}}$ are in the same component of $G = G_{p_{1}} \cup G_{p_{2}}$ , then their components in $G_{p_{1}}$ and $G_{p_{2}}$ , respectively, must be assigned the same label.

More precisely, consider parties $p_{1}$ and $p_{2}$ with graphs $G_{p_{1}}$ and $G_{p_{2}}$ , respectively, and let $C_{p_{i}}$ represent the set of components that constitute $G_{p_{i}}$ . Moreover, assume the vertices in $G_{p_{1}}$ and $G_{p_{2}}$ are drawn from some universe of vertices V. The two-party component labeling problem is to construct a map $Λ : C_{p_{1}} \cup C_{p_{2}} \to U$ , where $U$ is a universe of labels. For any two components $C_{1}, C_{n} \in C_{p_{1}} \cup C_{p_{2}}$ , $Λ (C_{1}) = Λ (C_{n})$ if and only if there is some series of components $C_{2}, \dots, C_{n - 1} \in C_{p_{1}} \cup C_{p_{2}}$ such that $C_{i} \cap C_{i + 1} \neq \emptyset$ for $i \in {1 \dots n - 1}$ .

4.2. Reducing ID agreement to two-party component labeling

We reduce the two-party identity agreement problem to two-party component labeling. Each party $p_{}$ represents its database as a graph $G_{p_{}} = (V_{p_{}}, E_{p_{}})$ as follows. Each piece of identifying information in party $p_{}$ ’s database is represented by a vertex in $p_{}$ ’s graph. (Empty entries in a database are simply left out of the graph.) Edges in the graph connect vertices which represent identifying information of the same user. Therefore, each record in a party’s database is represented as a component in the party’s graph.

The component labeling of two graphs $G_{p_{1}}$ and $G_{p_{2}}$ can be trivially used to assign user identities. The identifier of a user represented by a component C in $G_{p_{}}$ is directly copied from the label assigned to C during component labeling.

Intuitively, the reduction works because the two parties compute ID agreement over their databases by computing a union of their graphs. If the two parties’ graphs contain the same vertex v (meaning both databases contain the same piece of identifying information), then the components containing v in $G_{p_{1}}$ and $G_{p_{2}}$ are the same component in the union graph $G = G_{p_{1}} \cup G_{p_{2}}$ .

4.3. An algorithm for two-party component labeling

In this section, we present a component labeling algorithm without explicitly addressing privacy concerns. In Section 4.4, we present a protocol to implement the algorithm while preserving privacy. We illustrate an execution of our percolate-and-match algorithm in Fig. 5 and provide pseudo-code in Fig. 6.

Fig. 5.

Depiction of the percolate-and-match algorithm for component labeling. $v_{1}$ , $v_{2}$ , and $v_{3}$ are common to both $G_{p_{1}}$ an $G_{p_{2}}$ . Each vertex’s label is drawn inside the vertex, and its identity is on the side. $p_{1}$ ’s graph has an edge between $v_{2}$ and $v_{3}$ , while $p_{2}$ ’s graph has an edge between $v_{1}$ and $v_{2}$ . Solid lines depict edges in each party’s graph. Dotted lines depict matches that occur during matching phases. The figures show the evolution of the algorithm over 2 iterations.

Our component labeling algorithm is an iterative procedure in which the two parties assign labels to every vertex in their respective graphs and then progressively update their vertices’ labels. To initialize the procedure, each party $p_{}$ constructs an initial labeling for its local graph by assigning a unique label to every vertex in its graph $G_{p_{}}$ . Specifically, every vertex v in $G_{p_{}}$ is assigned the label v, the encoding of the vertex itself.3

In an application, the label would be the data that the vertex represents. Additionally, if the parties agree on an encoding scheme beforehand, types (address, zip, phone) can be encoded as part of a label at the cost of only a few bits.

Notice that in the initial labeling, no two vertices within a party’s graph are assigned the same label. However, any vertex that is included in both graphs is assigned the same label by both parties.

By the end of the iterative procedure, two properties of the labelings must be met. First, within each component of a party’s graph, all vertices must have the same label. Second, if any vertex is in both parties’ graphs, then the vertex has the same label in both parties’ labelings. Together, these two requirements enforce that every two vertices within a component of $C \subset G$ have the same label. This common label can then be taken as the component’s label.

Each step in our iterative procedure is a two phase process. The first phase operates on each party’s graph independently. It enforces the property that for each component in a party’s graph, every vertex in the component has the same label. In this phase, the algorithm assigns to every vertex $v \in G_{p_{}}$ the (lexicographic) minimum of all labels in its component in $G_{p_{}}$ . We call this a percolation phase because common labels are percolated to every vertex in a component.

In the second phase, the algorithm operates on the vertices which are common to both parties’ graphs. It ensures that every vertex v which is common to both parties’ graphs has been assigned the same label in the two parties’ labelings. If one party’s label for v differs from the other party’s label for v, then both labels are updated to the minimum of two labels that have been assigned to v. We call this a matching phase because vertices which are common to both graphs are assigned matching labels in the two labelings.

If some vertex’s label is updated in a matching phase, then its label may differ from the labels of the other vertices in its component. Therefore, the iterative procedure repeats until labelings stabilize. During percolation, each vertex’s label is set to the minimum label of all vertices in its component. If some vertex’s label changes during a matching phase, its new label must be “smaller” than its previous label. During the next percolation phase, the change is propagated by again updating the label of each vertex in the component is to the minimum label in the component. In Section 4.5, we prove that if m denotes the maximum size of a component in $G_{p_{1}} \cup G_{p_{2}}$ , then at most $m - 1$ iterations are necessary for vertex labels to stabilize.

Fig. 6.

Pseudocode for component labeling algorithm.

4.4. Private component labeling

We now provide a protocol which implements the component labeling algorithm described in Section 4.3 while preserving privacy. The ideal functionality $F^{lbl}$ for private two-party component labeling is given in Fig. 3.

4.4.1. Approach

The challenge of securely implementing the percolate-and-match algorithm arises from the fact that percolate-and-match performs two operations: (1) comparisons on vertices, and (2) updates on vertex labels. However, if either party knew the output of any such operation on its vertices, then it would learn information about the other party’s graph. Consider that if a participant learns that its vertex’s label changed during the any matching phase or during a percolation phase, it learns that one of the vertices in its graph has a matching vertex in the other party’s graph. Similarly, if a party learns that its vertex’s label isn’t updated during the first matching or percolation phase, it learns its vertex isn’t in the other party’s graph.

Our approach is to perform both vertex comparisons and label updates without revealing the output of any comparison or update, and to encrypt all intermediate and output labels so that no information is leaked about the computation. Naively adapting state-of-the-art PSI protocols in order to perform our matching phase does not work for this approach, because in addition to finding the common vertices in the two parties’ graphs, we must also perform updates on matching labels; state of the art PSI protocols do not provide easy ways to modify auxiliary data without revealing information.

To implement comparisons and updates, we use garbled circuits. Importantly, garbled circuits must implement oblivious algorithms, whose operation is independent of the input data. Notably, for any branch in the execution tree of an oblivious algorithm, we must perform operations for both possible paths, replacing the untaken path with dummy operations. Additionally, random accesses to an array (i.e. those for which the index being read is input-dependent) must either scan over the entire array, incurring a $O (N)$ cost, or use Oblivious RAM techniques, which incurs $log (N)$ communication overhead per access [2,23].

Matching via garbled circuits. To perform our matching phase obliviously, we adapt a technique described by Huang, Evans, and Katz for PSI [18]. In their scheme, called Sort-Compare-Shuffle, each party sorts its elements, then provides its sorted list to a garbled circuit which merges the two lists. If two parties submit the same element, then the two copies of the element land in adjacent indices in the merged list. The circuit iterates through the list, comparing elements at adjacent indices in order to identify common elements. After comparing, their circuit shuffles the sorted list before revealing elements in the intersection.

Our construction adapts the sort-compare-shuffle technique to efficiently perform our matching phase with label updates as follows. Each party submits a sorted list of vertices to a garbled circuit, including auxiliary information for each vertex that represents the vertex’s currently assigned label. To perform matching, we merge the two parties’ lists of vertices into one sorted list $\vec{L}$ and iterate through $\vec{L}$ . At each pair of adjacent indices in $\vec{L}$ , we conditionally assign both elements’ current labels to the minimum of the two labels only if the vertices match. Matching in this way via garbled circuit hides from both parties all of the matches that are made between the two parties’ graphs and their respective label updates.

Percolation via garbled circuits. To percolate labels within a component, a party can submit all of the vertices in one of its components to a garbled circuit along with each vertex’s current label. The circuit computes the minimum of the labels and assigns the minimum label to each vertex in the component.

Stitching percolation and matching together. The remaining question is how to efficiently stitch together percolation phases and matching phases without revealing intermediate labels of any vertices.We perform percolation and matching in the same circuit. To transition between percolation and matching phases, we permute the list of vertices. We define a permutation π which is hidden from both parties, and apply π and $π^{- 1}$ to $\vec{L}$ to transition from matching phase to percolation phase and back.

Our garbled circuit begins by merging the two parties’ sorted lists into one large list $\vec{L}$ . We then apply π to the list to shuffle the list, hiding all information about the sorted order of $\vec{L}$ . Next, we reveal indices of each party’s components in $π (\vec{L})$ . For the graph $G_{p_{i}}$ of each party i, and for each component $C \subset G_{p_{i}}$ , both parties learn the indices of C’s vertices in $π (\vec{L})$ . We use these indices to hard-wire the min-circuit that percolates labels within C. After percolating, we apply the inverse permutation $π^{- 1}$ to $π (\vec{L})$ and can again iterate through $\vec{L}$ to merge. The circuit repeatedly applies π and $π^{- 1}$ to $\vec{L}$ to transition from matching phase to percolation phase and back. After $m - 1$ iterations, the circuit outputs encrypted labels to the two parties.

We remark that permuting $\vec{L}$ and revealing indices of each party’s components avoids in-circuit random access to look up current vertex labels for each percolation circuit, and hence the overhead of ORAM. Revealing indices allows our circuit to hard-wire the indices of the min-circuits that perform percolation, achieving $O (1)$ cost per label lookup at the expense of an $O (n log n)$ shuffle between phases. We can consider that this technique allows us to amortize the expense of two permutations with cost $n log n$ over the n memory accesses we do for each iteration, which yields the same asymptotic complexity as ORAM [2,23].

Graph structure. There is one additional caveat. Revealing the indices of the vertices of each component in $\vec{L}$ reveals the structure of the parties’ graphs. To prevent this, we require that both parties pad their graphs to some predetermined structure. To simplify the presentation, we set a number of components C and a maximum size of each component m, and then have each party pad its graphs using randomly selected vertices until each graph contains $N = C m$ vertices.

Outputs. Each party receives as output XOR-shares of both parties’ labels and their own labels encrypted under a key held by the other party.

Fig. 7.

Illustration of the percolate-and-match garbled circuit approach.

4.4.2. Protocol in depth

We now describe how to privately implement our percolate-and-match algorithm. Figure 7 illustrates our approach, Fig. 8 contains the full protocol, and Fig. 9 describes our garbled circuits.

Fig. 8.

Full protocol for secure component labeling.

Fig. 9.

Garbled circuit for secure component labeling. The subcircuits ${GC}^{RevealOrder}$ and ${GC}^{Perc&Match}$ are defined by the procedures $RevealOrder$ and $PercAndMatch$ . In ${GC}^{Perc&Match}$ , variables ${out}_{p_{i}}$ and $I_{p_{i}}$ are public and must be hard-coded.

Protocol inputs. As input, each party $p_{}$ has a graph $G_{p_{}} = (V_{p_{}}, E_{p_{}})$ . If some party has fewer than m vertices in some component(s) of its graph, it pads its graph by randomly sampling vertices to add to its component(s). The two parties must agree on a number of components C and the maximum size of a component m, and each must pad its graph until it has $N = C m$ vertices.

In addition, each party $p_{i}$ has a key $k_{p_{i}}^{enc}$ for a CPA-Secure encryption scheme with XOR homomorphism.

Initial labelings. Each party represents the current labeling of its graph as a list of vertex descriptors. A descriptor $L (v)$ of a vertex v is a triple $({img}_{v}, {lbl}_{v}, {party}_{v})$ , where ${img}_{v}$ is the image of vertex v under a shared collision-resistant hash function $H : {0, 1}^{*} \to {0, 1}^{ℓ}$ , ${lbl}_{v}$ is the label assigned to v, and ${party}_{v}$ is the party’s identifier (1 or 2). We note that we use H to hash each vertex descriptor to a uniform length.

We refer to $p_{}$ ’s labeling as ${\vec{L}}_{p_{}} = {L (v)}_{v \in G_{p_{}}}$ . In the initial labeling that each party constructs, each vertex’s label is initially set to its image under H (meaning ${img}_{v} = {lbl}_{v}$ ). After constructing its labeling ${\vec{L}}_{p_{i}}$ , each party sorts its labeling ${\vec{L}}_{p_{i}}$ on the images of its vertices under H.

Garbled circuit part 1: Merge, permute, and reveal order. After the parties set up their inputs, they invoke a garbled circuit that merges their descriptors into a combined labeling $\vec{L}$ and reveals the indices of their descriptors under a random permutation. First, each party $p_{i}$ submits its sorted labeling ${\vec{L}}_{p_{i}}$ to a garbled circuit. The garbled circuit merges ${\vec{L}}_{p_{1}}$ and ${\vec{L}}_{p_{2}}$ into one list $\vec{L}$ using a Batcher Merge [3]. Second, the circuit shuffles $\vec{L}$ using a permutation π unknown to either party using Waksman networks [4,32]. Each party $p_{i}$ randomly samples a permutation $π_{p_{i}}$ on $2 N$ elements and inputs its selection bits $s_{p_{i}}$ which specify $π_{p_{i}}$ to the circuit. We define $π = π_{p_{1}} \circ π_{p_{2}}$ ; the circuit permutes $\vec{L}$ as $π (\vec{L})$ .

Next, the circuit reveals to each party the indices of its own vertices in $π (\vec{L})$ . First, each party samples $2 N$ one-time pads ${\vec{σ}}_{p_{i}}$ and submits them to the circuit. The garbled circuit then iterates through $π (\vec{L})$ , and at each index, the descriptor’s third element (which is the party identity) determines what to output to each party. If the vertex at index j was submitted by $p_{i}$ , then $p_{i}$ receives from the garbled circuit $π (\vec{L}) [j] . img$ , the image of the vertex at index j in the permuted list, and $p_{3 - i}$ receives $π (\vec{L}) [j] . img \oplus σ_{p_{i}, j}$ , which is the same image but masked by $p_{i}$ ’s jth one-time pad.

Given the indices of each of its vertices in $π (\vec{L})$ , each party computes the indices composing each of its components. Let $idx (v)$ denote the index of vertex v in $π (\vec{L})$ . For each component $C \subset G_{p_{i}}$ , $p_{i}$ computes $idx (C) = {idx (v)}_{v \in C}$ . For each component C in $G_{p_{i}}$ , $p_{i}$ shares $idx (C)$ with $p_{3 - i}$ . Note that each party $p_{i}$ learns the indices of its own vertices in $π (\vec{L})$ and it learns the indices corresponding to each of $p_{3 - i}$ ’s components in $π (\vec{L})$ . However, neither party learns the original positions of either party’s vertices in $\vec{L}$ . In the proof, we show that revealing these indices in $π (\vec{L})$ reveals no information about the other party’s inputs, assuming parties pad their graphs to uniform structure as Section 4.4.1 (otherwise, parties learn each other’s graph structure).

Garbled circuit part 2: Percolate and match. In the second subcircuit, the parties perform percolation and matching. The parties use the information revealed about their components’ indices in $π (\vec{L})$ to hard-wire the indices of each component in order to perform percolation. Percolation happens via independent subcircuit for each component in both parties’ graphs. For each component C, let $idx (C)$ be the indices of the component’s vertices in $π (\vec{L})$ . The circuit computes ${lbl}^{*} \leftarrow {min}_{j \in idx (C)} π (\vec{L}) [j] . lbl$ , and then assigns $π (\vec{L}) [j] . lbl \leftarrow {lbl}^{*}$ for each $j \in idx (C)$ .

Given a circuit with the parties’ descriptors arranged as $π (\vec{L})$ , the circuit applies $π^{- 1} = π_{p_{2}}^{- 1} \circ π_{p_{1}}^{- 1}$ to $π (\vec{L})$ to retrieve $\vec{L}$ . Note that this returns the order of the vertices to the initial (merged) ordering in Part 1; specifically, the sorting order does not change during the percolate-and- merge steps, except to permute back and forth as required. After returning to the original merged ordering, then performs matching by iterating through $\vec{L}$ and obliviously comparing the descriptors at each pair of adjacent indices in $\vec{L}$ .4

⁴

If not using equality to compare the descriptors, then one could substitute any other comparison circuit to evaluate matching between two elements.

Let

\vec{L} [i] = ({img}_{i}, {lbl}_{i}, {party}_{i})

be the descriptor at index i in

\vec{L}

, and let

\vec{L} [i + 1] = ({img}_{i + 1}, {lbl}_{i + 1}, {party}_{i + 1})

be the descriptor at index

i + 1

. If

{img}_{i} = {img}_{i + 1}

, then both

{lbl}_{i}

and

{lbl}_{i + 1}

are set to be the minimum among

{lbl}_{i}

and

{lbl}_{i + 1}

The circuit iterates between percolation and matching $m - 1$ times, applying π to transition from matching to percolation, and $π^{- 1}$ to transition from percolation to matching. After the final matching, the circuit applies π to transition to the output phase.

Encrypting vertex labels. At the end of the protocol, each party must receive its vertex labels encrypted under a key known only to the other party. We show how to move encryption outside of the garbled circuit in order to save the cost of online encryption at the expense of a few extra rounds of communication.

For a generic CPA-secure encryption scheme, we use the following technique. Consider a message m, computed within a circuit, that needs to be encrypted under a key k known only to $p_{1}$ without either party learning m. At the end, $p_{2}$ should learn $c = CPA . enc (k, m)$ . We can encrypt m under k as follows. $p_{2}$ samples a one-time pad υ and submits it to the circuit. The circuit outputs $m \oplus υ$ to $p_{1}$ . $p_{1}$ computes $c^{'} = enc (key, m \oplus υ)$ . Then, $p_{1}$ sends $c^{'}$ to $p_{2}$ , and $p_{2}$ computes $c = CPA . xor (c^{'}, υ) = CPA . enc (k, m)$ .

The garbled circuit produces outputs to the parties as follows. Let ${out}_{i}$ be the set of indices of $p_{i}$ ’s vertices in $π (\vec{L})$ . For the jth index in ${out}_{i}$ , $p_{3 - i}$ receives $π (\vec{L}) [{out}_{i} [j]] . lbl \oplus {\vec{ρ}}_{p_{i}} [j]$ , which is $p_{i}$ ’s jth label masked by $p_{i}$ ’s jth random pad. The parties use the technique above to recover their encrypted labels.

Interfacing with DDH-style PSI protocols. We present our technique for ElGamal encrypting vertex labels. As we show below, DDH-style PSI protocols can be modified to accept ElGamal-encrypted inputs.

First, the parties agree on ℓ group elements ${h_{j, 0}}_{j \in [ℓ]}$ . They then compute $h_{j, 1} = h_{i, 0}^{- 1}$ as the inverse of each element. Neither party should know the discrete logarithm of these group elements; this can be accomplished, for example, by using the Diffie-Hellman key exchange to agree on $h_{j, 0}$ for each j.

We represent a label m using these group elements by letting each pair of group elements $(h_{j, 0}, h_{j, 1})$ correspond to the possible values of the bit at position j of the label (the same group elements are used for all labels). Let $m_{j}$ be the jth bit of m. m can be represented as ${h_{j, m_{j}}}_{j \in [| m |]}$ . Notice that each bit can be inverted by computing $h_{i, b}^{q - 2}$ where q is the order of the group. Therefore, to compute $m \oplus p$ , it is sufficient to invert the elements representing m where the bits of p are 1.

Suppose m is the label of one of $p_{i}^{'} s$ vertices. As we described above, $p_{i}$ submits a mask υ to the circuit, and the circuit outputs $m^{'} = m \oplus υ$ to $p_{3 - i}$ . For each bit $m_{j}^{'}$ in $m^{'}$ , $p_{3 - i}$ will encrypt $h_{j, m_{j}^{'}}$ under its public key, and will send the ℓ ciphertexts for $m^{'}$ to $p_{i}$ . When $p_{i}$ receives its ciphertexts, it removes υ by implementing $ElGl . Xor$ as follows. For each bit $υ_{j}$ of υ where $υ_{j} = 1$ , $p_{i}$ uses $ElGl . Mul$ to invert the plaintext of the corresponding bit ciphertext. Finally, $p_{i}$ uses $ElGl . Mul$ to combine the bit-ciphertexts into a single label.

Recall that DDH-style PSI protocols proceed as follows:

$p_{1}$ chooses a random exponent $R_{1}$ and, for each element $S_{1, i}$ in its set, sends $S_{1, i}^{R_{1}}$ to $p_{2}$ .

$p_{2}$ chooses a random exponent $R_{2}$ and, for each element $S_{2, j}$ in its set, computes $S_{2, j}^{R_{2}}$ . It then computes ${(S_{1, i}^{R_{1}})}^{R_{2}}$ , and sends ${S_{1, i}^{R_{1} R_{2}}}$ and ${S_{2, j}^{R_{2}}}$ to $p_{1}$ .

$p_{1}$ computes ${(S_{2, j}^{R_{2}})}^{R_{1}}$ and the intersection.

We note that the exponentiations of each party’s input elements can actually be performed using $ElGl . Mul$ . Since the other party has the key to decrypt these ciphertexts, the protocol proceeds as follows:

$p_{1}$ samples a random exponent $R_{1}$ and, for each element ${enc}_{{pk}_{2}} (S_{1, i})$ in its set, sends ${enc}_{{pk}_{2}} (S_{1, i}^{R_{1}})$ to $p_{2}$ .

$p_{2}$ samples a random exponent $R_{2}$ and, for each element ${enc}_{{pk}_{1}} (S_{2, j})$ in its set, computes ${enc}_{{pk}_{1}} (S_{2, j}^{R_{2}})$ . $p_{2}$ computes $S_{1, i}^{R_{1}} \leftarrow {dec}_{{sk}_{2}} ({enc}_{{pk}_{2}} (S_{1, i}^{R_{1}}))$ and then ${(S_{1, i}^{R_{1}})}^{R_{2}}$ , and sends ${S_{1, i}^{R_{1} R_{2}}}$ and ${{enc}_{{pk}_{1}} (S_{2, j}^{R_{2}})}$ to $p_{1}$ .

$p_{1}$ decrypts, computes ${(S_{2, j}^{R_{2}})}^{R_{1}}$ , and computes the intersection.

We require that our El-Gamal transformation maintains uniqueness of labels. This is to say that there should be no two labels m, $m^{'}$ that map to the same group element under the transformation.

Recall that we represent a label m group elements by letting a pair of group elements $(h_{j, 0}, h_{j, 1})$ corresponds to the possible values of the bit at position j of m (the same group elements are used for all labels). Let $m_{j}$ be the jth bit of m. m can be represented as ${h_{j, m_{j}}}_{j \in [| m |]}$ . Recall as well that for all j, $h_{j, 0} = {(h_{j, 1})}^{- 1}$ .

To maintain uniqueness of labels under the transformation, we require that for all $j \in [ℓ]$ and for all labels m, $m^{'} \neq m$ : $\begin{matrix} \prod_{j \in [ℓ]} h_{j, m_{j}} \neq \prod_{j \in [ℓ]} h_{j, m_{j}^{'}} \end{matrix}$

More simply, we can define the event of a label collision as one in which two distinct labels are assigned the same group element. For labels m, $m^{'}$ , we define $I_{m, m^{'}} = {i : m_{i} \neq m_{i}^{'}}$ . Then a hash collision occurs if and only if $\begin{matrix} \prod_{i \in I_{m, m^{'}}} h_{i, m_{i}} = \prod_{i \in I} h_{i, m_{i}^{'}} \end{matrix}$ or equivalently, $\begin{matrix} \frac{\prod_{i \in I_{m, m^{'}}} h_{i, m_{i}}}{\prod_{i \in I_{m, m^{'}}} h_{i, m_{i}^{'}}} = 1 \end{matrix}$ and because $h_{i, m_{i}}$ is the inverse of $h_{i, m_{i}^{'}}$ by definition, the requirement is equivalently $\begin{matrix} \prod_{i \in I_{m, m^{'}}} {(h_{i, m_{i}})}^{2} = 1 \end{matrix}$

In any multiplicative group G of prime order q, there are only two unique elements $g \in G$ for which $h^{2} = 1$ (these are 1 and $- 1 mod q$ ). The probability that any two labels m, $m^{'} \neq m$ map to the same group element is therefore bounded by the probability that $\prod_{i \in I_{m, m^{'}}} h_{i, m_{i}} \in {\pm 1 mod q}$ . When all $h_{i, 0}$ are selected at random, this is exactly $\frac{2}{q}$ (and it is independent of the size of $I_{m, m^{'}}$ if any $h_{i, 0}$ is selected at random). By a union bound, the probability that there exist any two labels that map to the same group element is upperbounded by $\frac{2 C^{2}}{q}$ , where C is the number of unique components in $G_{p_{1}} \cup G_{p_{2}}$ . Therefore, the size of the group used for the technique can be set so that this probability is small.

We note that the size of the group must already be large enough that computing discrete logarithms is hard, and that the probability of guessing correctly is negligible in the security parameter. Our upperbound on the probability of a collision is a factor of $2 C^{2}$ larger than the probability of correctly guessing a discrete log, which for appropriate parameters is still negligible in the security parameter.

Protocol outputs. Each party outputs two sets of encrypted labels. First, each party outputs the other party’s masked labels, which it receives from the garbled circuit. Second it output its own vertices’ encrypted labels.

Parties associate their encrypted labels with their vertices based on the order in which they receive their encrypted labels. In the earlier reveal phase, the parties learn the indices of their own vertices in the permuted list. They sort their vertices based on their indices in that list, and then associate the sorted vertices in order with the encrypted labels they receive. To choose a component label, a party arbitrarily selects any label assigned to a vertex in the component.

4.5. Termination of percolate-and-match

Given two parties’ graphs, $G_{p_{1}}$ and $G_{p_{2}}$ and their union $G = G_{p_{1}} \cup G_{p_{2}}$ , we now prove how many iterations of the percolate and match algorithm are required until vertex labels stabilize in both parties’ graphs. We will show that if m is the maximum size of any connected component in G, then percolate-and-match stabilizes in at most $m - 1$ iterations.

Theorem 1 (Termination of Percolate-and-Match).

Let V be a set of vertices, let $G_{p_{1}} = (V_{1}, E_{1})$ , $G_{p_{2}} = (V_{2}, E_{2})$ such that $V_{1} \subset V$ and $V_{2} \subset V$ , and let $G = G_{1} \cup G_{2}$ . If the maximum size (by number of vertices) of a connected component in G is m, then in the worst case, $m - 1$ iterations of the percolate-and-match algorithm are both necessary and sufficient for vertex labels to stabilize.

Proof.
We show that vertex labels stabilize after $m - 1$ iterations, where m is a parameter denoting the largest component (by number of vertices) in $G = G_{p_{1}} \cup G_{p_{2}}$ . We do so by analyzing how labels percolate between vertices during each iteration. If in the beginning an iteration, the label of some vertex $u \in V_{p_{}}$ is $lbl$ , and at the end of the iteration, the label of u is ${lbl}^{'}$ , then we say ${lbl}^{'}$ reaches u during the iteration.

Consider any component $C \subset G$ . Let v be the vertex in C with the minimum label in either party’s initial labeling, and let its label be ${lbl}^{}$ . We show that in each iteration until labels stabilize, ${lbl}^{}$ must reach at least one new vertex in both $G_{p_{1}}$ and $G_{p_{2}}$ . As m denotes the maximum number of vertices in a component, ${lbl}^{}$ reaches every vertex in C in at most $m - 1$ iterations.

If in any percolation phase, ${lbl}^{}$ does not reach a new vertex in either party’s graph (meaning it does not reach a new vertex in $G_{p_{1}}$ and it does not reach a new vertex in $G_{p_{2}}$ ), then the label has finished percolating because it cannot match to a new vertex in the following matching phase. Similarly, if during any matching phase, ${lbl}^{}$ does not match from a vertex in one party’s graph to a vertex in the other party’s graph which is not yet labeled ${lbl}^{}$ , then the label has finished percolating because it reaches no new components in the phase. Therefore, if there is ever an iteration of the algorithm in which ${lbl}^{}$ does not reach a new vertex in either party’s graph, then iteration is complete. By contrapositive, while iteration is not complete, ${lbl}^{}$ must reach at least one new vertex in some party’s graph during each percolation phase, and in each matching phase it must match to the same vertex in the other party’s graph.

Fig. 10.
Worst case example for the number of iterations until labels stabilize. $v_{1}$ , $v_{2}$ , $v_{3}$ , and $v_{4}$ are in both $G_{p_{1}}$ and $G_{p_{2}}$ . Solid lines represent edges in $G_{p_{1}}$ or $G_{p_{2}}$ . Dotted lines represent matches during the matching phase. Three iterations are required to percolate a label from $v_{1}$ to $v_{4}$ in $p_{1}$ ’s graph.

In Fig. 10, we provide an example of a graph in which ${lbl}^{}$ reaches only one new vertex in each iteration. This completes the proof, as it shows that $m - 1$ iterations are required for some graphs in which the maximum component size is m. One could analogously construct an example for any value of m. At each percolation phase, ${lbl}^{}$ reaches one new vertex, and in the following matching phase the label matches to a new sub-component in the other party’s graph. The algorithm therefore takes $m - 1$ iterations to percolate a label to all m vertices in C. □
5. Proof of security

Our protocol for component labeling achieves security in the honest-but-curious model. We write the proof in a hybrid model in which the parties have access to a functionality $F^{GC}$ that takes the place of their garbled circuit evaluations. As described in Fig. 4, $F^{GC}$ takes the description of a circuit c and two parties’ inputs and it returns the evaluation of c on those inputs to the parties.

We denote by $F^{lbl} = (F_{1}^{lbl}, F_{2}^{lbl})$ the two-party component labeling functionality. Recall that $F_{i}^{lbl}$ is a pair $(f_{p_{3 - i}}^{mask} (x, y), f_{p_{i}}^{enc} (x, y))$ . We denote by Π our component labeling protocol. Denote by ${output}^{Π} = ({output}_{1}^{Π}, {output}_{2}^{Π})$ the pair of random variables describing each party’s output of a real execution of Π.

Let ${VIEW}_{p_{}}^{Π} (x, y, λ)$ be the view of party $p_{}$ in a real execution of Π when $p_{1}$ has input x, $p_{2}$ has input y, and λ is the security parameter. Recall that $x = (G_{p_{1}}, k_{p_{1}}^{enc}, {\vec{ρ}}_{p_{1}})$ and $y = (G_{p_{2}}, k_{p_{2}}^{enc}, {\vec{ρ}}_{p_{2}})$ . $p_{i}$ ’s view in an execution of Π is composed of its input, its internal randomness $r_{p_{i}}$ , and the messages it receives during the protocol. Before proceeding, we note that our construction fixes $k^{enc}$ and $\vec{ρ}$ as an input for each party. This is analogous to allowing these variables to be generated by each party during execution, and fixing the randomness in each party’s view used to generate these variables. Next, we state our theorem:

Theorem 2 (Security with respect to honest-but-curious adversaries).

In the $F^{GC}$ -hybrid model, there exist PPT simulators $S_{1}$ and $S_{2}$ such that for all inputs x, y and security parameter λ: $\begin{array}{l} {({VIEW}_{p_{1}}^{Π} (x, y, λ), {output}^{Π} (x, y, λ))}_{x, y, λ} & \approx {(S_{1} (1^{λ}, x, F_{1}^{lbl} (x, y)), F^{lbl} (x, y))}_{x, y, λ} \\ {({VIEW}_{p_{2}}^{Π} (x, y, λ), {output}^{Π} (x, y, λ))}_{x, y, λ} & \approx {(S_{2} (1^{λ}, y, F_{2}^{lbl} (x, y)), F^{lbl} (x, y))}_{x, y, λ} \end{array}$

Proof.
We describe how $S_{1}$ simulates the view of $p_{1}$ ; $S_{2}$ is analogous. At a high level, $S_{1}$ generates a view by randomly sampling an input graph for $p_{2}$ , and then faithfully simulating the interaction until the last step. For the last step $S_{1}$ uses its knowledge of $p_{1}$ ’s ideal-functionality outputs to ensure that $p_{1}$ ’s simulated view implies the correct outputs, and that the final messages sent by $p_{1}$ are consistent with the honest $p_{2}$ ’s outputs.

Intuitively, the strategy works for $S_{1}$ for the following reason. $p_{1}$ learns two kinds of information from the messages it receives. First, it learns encryptions of its own and the other party’s vertex labels. For these messages, $S_{1}$ generates encryptions of junk which are indistinguishable from $p_{1}$ ’s real messages. Second, $p_{1}$ learns information about the ordering of the images of $p_{1}$ and $p_{2}$ ’s vertices under a permutation π, which is defined by composing $π_{p_{1}}$ and $π_{p_{2}}$ , the permutations selected by the honest parties. Because each honest party randomly selects its permutation $π_{p_{i}}$ , π is a random permutation as long as one participant is honest. Therefore, the permuted ordering is distributed identically to the ordering of the images of $p_{1}$ ’s vertices and $S_{1}$ ’s dummy vertices when $S_{1}$ selects its own random permutation $π_{S}$ and the images are permuted by $π^{'} = π_{p_{1}} \circ π_{S_{1}}$ .

The description of $S_{1}$ follows:
Random Tapes: $S_{1}$ uniformly samples $r_{p_{1}}$ as its random tape for the simulation of $p_{1}$ and $r_{p_{2}}$ as its random tape for the emulation of $p_{2}$ . (Each party’s tape must be long enough to provide randomness for every encryption that it must compute and all of the one-time pads it must generate.)

Simulated inputs for $p_{2}$ : $S_{1}$ randomly samples an input graph $G_{S}$ that it uses as input for $p_{2}$ . It samples $G_{S}$ by sampling N vertices $V_{S} \leftarrow V$ , and then randomly partitioning $V_{S}$ into components of size m subject to the constraint that in the graph $G = G_{p_{1}} \cup G_{S}$ , there are no components of size larger than m. $S_{1}$ does not randomly sample an encryption key to serve as $k_{p_{2}}^{enc}$ and it does not sample one time pads to serve as replacements for ${\vec{ρ}}_{p_{2}}$ .

Adversary’s Choice of Label Functionality $S_{1}$ learns the labeling function Λ chosen by the adversary in place of the ideal functionality.

Honest Execution: For every step of the protocol except for those in which $p_{1}$ receives its protocol outputs, $S_{1}$ faithfully emulates the behavior of $p_{1}$ and $p_{2}$ in interaction with each other using inputs x for $p_{1}$ , $G_{S}$ as $p_{2}$ ’s graph, and $r_{p_{1}}$ and $r_{p_{2}}$ as the parties’ internal random tapes.

Fixing $p_{1}$ ’s outputs: $S_{1}$ deviates from its strategy of faithfully emulating the execution of $p_{1}$ and $p_{2}$ order to ensure that the view constructed for $p_{1}$ is consistent with the ideal-functionality output of $p_{1}$ . Recall that $F_{1}^{lbl} (x, y) = (f_{p_{2}}^{mask} (x, y), f_{p_{1}}^{enc} (x, y))$ . $S_{1}$ fixes the messages received by $p_{1}$ as follows:

$f_{p_{2}}^{mask} (x, y)$ : $S_{1}$ provides $p_{1}$ with $f_{p_{2}}^{mask}$ , which is $p_{2}$ ’s ideal-functionality labels masked with random strings in $p_{2}$ ’s input y.

$f_{p_{1}}^{enc} (x, y)$ : $S_{1}$ masks the encrypted labels given by $F_{1}^{lbl}$ with ${\vec{ρ}}_{p_{1}}$ , which are the random strings in $p_{1}$ ’s input designated for masking its final labels. (Recall that in a real execution, $p_{1}$ submits these to the garbled circuit.) Let $e_{j}$ for $j \in [N]$ be the jth encrypted label given by $f_{p_{1}}^{enc}$ . $S_{1}$ provides $e_{j} \oplus ρ_{p_{1}, j}$ for each masked-and-encrypted label in $p_{1}$ ’s output.

We proceed to compare the distributions of messages that $p_{1}$ receives in a real execution with the distributions of messages that $S_{1}$ constructs for $p_{1}$ .
$p_{1}$ ’s first message: In a real execution, $p_{1}$ receives ${o_{p_{i}, j}^{order}}_{j \in [2 N]}$ . We divide these $2 N$ strings into two sets of N strings. In the first set, the garbled circuit returns the images of $p_{1}$ ’s vertices under H, randomly permuted. In the other set, $p_{1}$ receives the images of $p_{2}$ ’s vertices under H, randomly permuted and masked by one-time pads generated by $p_{2}$ . The first set in ${o_{p_{i}, j}^{order}}_{j \in [2 N]}$ reveal the indices of $p_{1}$ ’s vertex images when sorted with $p_{2}$ vertex images and permuted by π.

In $S_{1}$ ’s generated view, the message received by $p_{1}$ is different in two ways. First, its vertices are permuted by some other random permutation $π^{'}$ ; second, the messages it receives for $p_{2}$ ’s vertices are the images of $V_{S}$ , masked by one-time pads generated by $S_{1}$ , rather than the images of $V_{p_{2}}$ masked by pads generated by $p_{2}$ .

$p_{1}$ ’s second message: In a real execution, for each component $C \in G_{p_{2}}$ , $p_{1}$ receives the indices of C in the permuted image of $V_{p_{1}} | | V_{p_{2}}$ . (Recall that each component is size m, so for each $C \in G_{p_{2}}$ , $p_{1}$ receives m unique indices in $[2 N]$ .) In the view generated by $S_{1}$ , for each component $C \in G_{S}$ , $p_{1}$ receives the indices of C in the permuted form of $V_{p_{1}} | | V_{S}$ .

$p_{1}$ ’s third message: In a real execution, $p_{1}$ invokes $F^{GC}$ to evaluate ${GC}^{Perc&Match}$ and receives the set of $p_{2}$ ’s output labels, masked by $p_{2}$ ’s one-time pads. $S_{1}$ does not invoke $F^{GC}$ , but simply provides $p_{1}$ with $f_{p_{2}}^{mask} (x, y)$ , which it learns from $p_{1}$ ’s ideal-functionality output. The difference between the messages in the real and simulated execution is that in a real execution, these labels are labels of the vertices in $V_{p_{1}}$ computed by $F^{GC}$ and masked with ${\vec{ρ}}_{p_{2}}$ , while in the view generated by $S_{1}$ , the labels are computed by $F^{lbl}$ using the adversary’s labeling function Λ and masked with ${\vec{ρ}}_{p_{2}}$ .

$p_{1}$ ’s fourth message: In a real execution, $p_{1}$ receives ${ϕ_{p_{1}, j}}_{j \in [N]}$ from $p_{2}$ , where each $ϕ_{p_{1}, j} = enc (k_{p_{2}}^{enc}, o_{j}) = enc (k_{p_{2}}^{enc}, {lbl}_{j} \oplus {\vec{ρ}}_{p_{1}, j}) = enc (k_{p_{2}}^{enc}, {lbl}_{j}) \oplus {\vec{ρ}}_{p_{1}, j}$ . $p_{1}$ removes its masks, after which it has N labels encrypted under $k_{p_{2}}^{enc}$ .

$S_{1}$ fixes this message to ensure that $p_{1}$ ’s simulated view is consistent with $f_{p_{1}}^{enc}$ , which it learns from $F^{lbl}$ . For $j \in [N]$ , $S_{1}$ provides $p_{1}$ with $e_{j} \oplus {\vec{ρ}}_{p_{1}, j}$ . (In a real execution, an honest $p_{1}$ would remove the masks ${\vec{ρ}}_{p_{1}, j}$ to derive its encrypted labels.)

To satisfy the definition of security, we must also show that the view generated by $S_{1}$ is consistent with the ideal functionality outputs of $p_{2}$ . Recall that $F_{2}^{lbl} (x, y) = (f_{p_{1}}^{mask} (x, y), f_{p_{2}}^{enc} (x, y))$ .
$f_{p_{2}}^{enc} (x, y)$ : The consistency of this output of $p_{2}$ with the view output by $S_{1}$ is implied by the fact that $S_{1}$ fixes $p_{1}$ ’s penultimate message to be exactly $f_{p_{2}}^{mask}$ , which it receives from the ideal functionality. Recall that in a real execution, $p_{1}$ encrypts the padded outputs it receives from the final garbled circuit using its encryption key $k_{p_{1}}^{enc}$ , and then $p_{2}$ removes the pads ${\vec{ρ}}_{p_{2}}$ from these encryptions after receiving them from $p_{1}$ . $p_{2}$ ’s output $f_{p_{2}}^{enc}$ is precisely the set of encryptions yielded by removing the pads ${\vec{ρ}}_{p_{2}}$ from the encryptions sent by $p_{1}$ .

Therefore, $S_{1}$ provides $p_{1}$ with precisely the masked pads that $p_{1}$ would then encrypt and send back to $p_{2}$ ; this is the message $f_{p_{2}}^{mask} (x, y)$ which $S_{1}$ receives from $F_{1}^{lbl}$ as part of $p_{1}$ ’s output and forwards to $p_{1}$ . The consistency of $p_{2}$ ’s ideal functionality output with the message sent by $p_{1}$ follows directly from the fact that the output of $p_{2}$ is a set of encryptions of the messages that $S_{1}$ forwards to $p_{1}$ from the ideal functionality.

$f_{p_{1}}^{mask} (x, y)$ : This is the set of labels of $p_{1}$ ’s vertices, masked with $p_{1}$ ’s one-time pads, which $p_{2}$ outputs. (Recall that in a real execution, $p_{2}$ receives these masked labels from the instance of $F^{GC}$ which computes ${GC}^{Perc&Match}$ ; it then encrypts them using its encryption key, and sends the encryptions to $p_{1}$ , who removes the masks homomorphically and outputs the encryptions.) This output must be consistent with
the message that $p_{2}$ computes as a function of $f_{p_{1}}^{mask} (x, y)$ and sends to $p_{1}$ as $p_{1}$ ’s final message. (Specifically, this is the set of encryptions that $p_{2}$ computes and sends to $p_{1}$ in order to unmask and output.)

the garbled circuit inputs that are implied by $p_{1}$ ’s view.
Consistency with $p_{1}$ ’s final message: In order for $f_{p_{1}}^{mask} (x, y)$ to be consistent with $p_{1}$ ’s final message in the view generated by $S_{1}$ , it must be the case that the message $p_{1}$ receives is composed of encryptions of $f_{p_{1}}^{mask} (x, y)$ under $p_{2}$ ’s encryption key.

This is the case, since $S_{1}$ fixes $p_{1}$ ’s final message to be the pairwise XOR of $f_{p_{1}}^{enc}$ with ${\vec{ρ}}_{p_{1}}$ , where $f_{p_{1}}^{enc}$ is given to $S$ by $F_{1}^{lbl}$ , and $f_{p_{1}}^{enc}$ is defined to be the result of encrypting $f_{p_{1}}^{mask}$ and then homomorphically removing the pads ${\vec{ρ}}_{p_{1}}$ .

Consistency with GC Inputs: $f_{p_{1}}^{mask} (x, y)$ must be consistent with the $p_{1}$ ’s inputs to the instance of $F^{GC}$ which computes ${GC}^{Perc&Match}$ : ${\vec{L}}_{p_{1}}$ and ${\vec{ρ}}_{p_{1}}$ . ${\vec{ρ}}_{p_{1}}$ is specified by $p_{1}$ ’s input, and ${\vec{L}}_{p_{1}}$ lists the images of $p_{1}$ ’s input vertices under the function H. Note that both of these inputs are independent of $p_{2}$ ’s inputs. The consistency of $p_{1}$ ’s view with $p_{2}$ ’s output therefore follows from the security of $F^{GC}$ .

We proceed with the full proof of security in presence of a corrupt $p_{1}$ ; specifically, in the $F^{GC}$ -hybrid model, there exists a PPT simulator $S_{1}$ such that for all inputs x, y and security parameters λ: $\begin{matrix} {({VIEW}_{p_{1}}^{Π} (x, y, λ), {output}^{Π} (x, y, λ))}_{x, y, λ} \approx {(S_{1} (1^{λ}, x, F_{1}^{lbl} (x, y)), F^{lbl} (x, y))}_{x, y, λ} \end{matrix}$ The proof follows from a hybrid argument. In each hybrid we give a simulator that produces a random variable describing the view of $p_{1}$ in either a real or simulated execution.
This is the viewproduced by a simulator $S$ that knows $p_{1}$ ’s and $p_{2}$ ’s inputs and faithfully executes the protocol on their behalves using their inputs. It is distributed identically to ${VIEW}_{p_{1}}^{Π} (x, y, λ)$

This identical to ${Hyb}_{0}$ , except for the order in which $p_{1}$ receives its outputs from $F^{GC}$ on the evaluation of ${GC}^{RevealOrder}$ and for the indices that it receives in its second message.

Order in first message: Recall that in a real execution, $p_{1}$ and $p_{2}$ each select $2 N$ one-time pads ${\vec{σ}}_{p_{i}}$ and Waksman selection bits $s_{p_{i}}$ and submit these along with their vertices to $F^{GC}$ in order to evalaute ${GC}^{RevealOrder}$ . In this hybrid, instead of delivering $p_{1}$ ’s output in the order corresponding to $F^{GC} ({GC}^{RevealOrder}, {\vec{L}}_{p_{1}}, {\vec{L}}_{p_{2}}, s_{p_{1}}, s_{p_{2}}, {\vec{σ}}_{p_{1}}, {\vec{σ}}_{p_{2}})$ (as in a real execution), $S$ randomly samples a graph $G_{S}$ , Waksman select bits $s_{S}$ , and one-time pads ${\vec{σ}}_{S}$ that it uses as input for $p_{2}$ , and additionally invokes $F^{GC} ({GC}^{RevealOrder}, {\vec{L}}_{p_{1}}, {\vec{L}}_{S}, s_{p_{1}}, s_{S}, {\vec{σ}}_{p_{1}}, {\vec{σ}}_{S})$ . $S$ then changes the indices in which $p_{1}$ receives its own labels as if receiving output from $F^{GC} ({GC}^{RevealOrder}, {\vec{L}}_{p_{1}}, {\vec{L}}_{S}, s_{p_{1}}, s_{S}, {\vec{σ}}_{p_{1}}, {\vec{σ}}_{S})$ . (Note that $S$ does not actually use the outputs of $F^{GC} ({GC}^{RevealOrder}, {\vec{L}}_{p_{1}}, {\vec{L}}_{S}, s_{p_{1}}, s_{S}, {\vec{σ}}_{p_{1}}, {\vec{σ}}_{S})$ , but it delivers $p_{1}$ ’s outputs from $F^{GC} ({GC}^{RevealOrder}, {\vec{L}}_{p_{1}}, {\vec{L}}_{p_{2}}, s_{p_{1}}, s_{p_{2}}, {\vec{σ}}_{p_{1}}, {\vec{σ}}_{p_{2}})$ in the order defined by $F^{GC} ({GC}^{RevealOrder}, {\vec{L}}_{p_{1}}, {\vec{L}}_{S}, s_{p_{1}}, s_{S}, {\vec{σ}}_{p_{1}}, {\vec{σ}}_{S})$ .)

To better discuss the order in which labels are output, we define an n-ordering to be a list containing an n-sized subset of ${1, \dots 2 n}$ . We define a procedure $O (π, l_{1}, l_{2})$ which takes a permutation π and two lists $l_{1}$ and $l_{2}$ such that $| l_{1} | = | l_{2} | = n$ and no list $l_{i}$ contains duplicate elements (although the two lists have elements in common with each other). $O$ constructs $l_{3} = l_{1} | | l_{2}$ , and then sorts $l_{3}$ . Finally, it applies π to $l_{3}$ . $O$ returns an n-ordering which for each element of $l_{1}$ gives its index in $l_{3}$ .

Observe that the indices of $p_{1}$ ’s outputs of $F^{GC}$ in ${Hyb}_{0}$ correspond to $O (π, V_{1}, V_{2})$ and that the indices of $p_{1}$ ’s outputs of $F^{GC}$ in ${Hyb}_{1}$ correspond to $O (π^{'}, V_{1}, V_{S})$ , where $π = π_{1} \circ π_{2}$ and $π^{'} = π_{1} \circ π_{S}$ . Because both π and $π^{'}$ are random, the n-orderings describing $p_{1}$ ’s indices in ${Hyb}_{0}$ and ${Hyb}_{1}$ are independent of both the third argument to $O$ and the contents of $l_{1}$ and $l_{2}$ . It follows that $p_{1}$ ’s first message is identically distributed in ${Hyb}_{0}$ and ${Hyb}_{1}$ .

Indices in second message: $p_{1}$ ’s second message contains random assignments of $p_{2}$ ’s indices into k-sized components. As we argued for $p_{1}$ ’s first message, the division of $p_{2}$ ’s indices into components is independent of $p_{2}$ ’s vertices because the permutation on $V_{p_{1}} | | V_{p_{2}}$ (or $V_{p_{1}} | | V_{S}$ ) induced by π (or $π^{'}$ ) is random. It follows that the message $p_{1}$ receives in ${Hyb}_{0}$ is distributed identically to the message that it receives in ${Hyb}_{1}$ . Thus, $p_{2}$ ’s second message is identically distributed in ${Hyb}_{0}$ and ${Hyb}_{1}$ , and ${Hyb}_{0}$ is distributed identically to ${Hyb}_{1}$ .

This is identical to ${Hyb}_{1}$ , except for the outputs that $p_{1}$ receives from invoking $F^{GC}$ on ${GC}^{RevealOrder}$ .

$S$ invokes $F^{GC}$ on ${GC}^{RevealOrder}$ with $p_{1}$ ’s inputs and the dummy inputs that it generates for $p_{2}$ . In contrast to ${Hyb}_{1}$ , $S$ no longer discards the output of $F^{GC} ({GC}^{RevealOrder}, {\vec{L}}_{p_{1}}, {\vec{L}}_{S}, s_{p_{1}}, s_{S}, {\vec{σ}}_{p_{1}}, {\vec{σ}}_{S})$ but uses it in place of the output of $F^{GC} ({GC}^{RevealOrder}, {\vec{L}}_{p_{1}}, {\vec{L}}_{p_{2}}, s_{p_{1}}, s_{p_{2}}, {\vec{σ}}_{p_{1}}, {\vec{σ}}_{p_{2}})$

We divide our analysis of $p_{1}$ ’s first message in ${Hyb}_{1}$ and ${Hyb}_{2}$ into two parts. First we consider only those indices for which $p_{1}$ receives one of $p_{2}$ ’s masked labels. Second we consider only those indices for which $p_{1}$ receives its own labels.

The set of strings corresponding to $p_{2}$ ’s indices in ${Hyb}_{1}$ and in ${Hyb}_{2}$ are identically distributed, since both ${\vec{σ}}_{p_{2}}$ and ${\vec{σ}}_{S}$ are one-time pads sampled uniformly at random, and mask the labels output to $p_{1}$ which are not its own. The distributions of these masked strings that $p_{1}$ receives at $p_{2}$ ’s indices in ${Hyb}_{1}$ and ${Hyb}_{2}$ are both identical to the uniform random distribution.

For all the indices for which $p_{1}$ receives one of its own labels, the set of strings it receives in ${Hyb}_{2}$ is distributed identically to the set that $p_{1}$ receives in ${Hyb}_{1}$ . In both cases it receives precisely the images of its own vertex labels under the function H. We conclude that ${Hyb}_{2}$ is distributed identically to ${Hyb}_{1}$ .

This is identical to ${Hyb}_{2}$ , except that the third message received by $p_{1}$ is replaced by $f_{p_{2}}^{mask}$ , which $S$ learns from $p_{1}$ ’s ideal-functionality output. This takes the place of the output of $F^{GC} ({GC}^{Perc&Match}, \dots)$ , which $S$ does not invoke in this Hybrid. In both ${Hyb}_{2}$ and ${Hyb}_{3}$ , $p_{1}$ receives the other party’s vertex labels masked by ${\vec{ρ}}_{p_{2}}$ . this message in ${Hyb}_{3}$ is identically distributed to ${Hyb}_{2}$ if the labels generated by $F^{lbl}$ are identically distributed to the vertex labels constructed in a real execution. In a real execution, the parties invoke a collision-resistant hash function to compute the images of their vertices, and the labeling function computes the minimum label of any vertex image in a component. In an ideal execution, the parties again use a collision-resistant hash function, and the labeling function is provided by the adversary. The two distributions are identical if
there are no collisions of vertex images (preserving correctness)

the labeling function applied by $F^{lbl}$ is the same computed by $F^{GC}$ .
We note that the outputs space of the hash function is subject to the birthday bound, in the best case when it is well-balanced. However, the output space can be chosen to be large enough that the probability of a collision is a negligible function of the security parameter. Second, we require that $F^{GC}$ correctly implement the labeling function Λ, and we note that in any protocol in which the adversary can specify the labeling function to the parties, they can hardwire the appropriate labeling function into the circuit which they compute. Therefore, for appropriately chosen parameters, ${Hyb}_{2}$ and ${Hyb}_{3}$ are computationally indistinguishable.

This is identical to ${Hyb}_{3}$ , except that the fourth message received by $p_{1}$ is replaced by the element-wise xor of $f_{p_{1}}^{enc}$ and ${\vec{ρ}}_{p_{1}}$ . In both ${Hyb}_{3}$ and ${Hyb}_{4}$ , $p_{1}$ receives encryptions of its output labels, masked with ${\vec{ρ}}_{p_{1}}$ . Remove the pad from each using ${\vec{ρ}}_{p_{1}}$ , and $p_{1}$ has N encryptions of plaintext labels under $k_{p_{2}}^{enc}$ . By the semantic security of the encryption scheme, these are computationally indistinguishable.

${Hyb}_{4}$ is identical to $S_{1} (1^{λ}, x, F_{1}^{lbl} (x, y))$ ; therefore, as we have already shown that the view output by $S_{1}$ is consistent with the ideal functionality output, we conclude that ${(S_{1} (1^{λ}, x, F_{1}^{lbl} (x, y)), F^{lbl} (x, y))}_{x, y} \approx_{c} {({VIEW}_{p_{1}}^{Π} (x, y, λ), {output}^{Π} (x, y))}_{x, y}$ . The simulator and analysis for a corrupt $p_{2}$ is analogous. This concludes the proof. □

6. Evaluation

6.1. Asymptotic analysis

The offline cost of the protocol is dominated by setup and encryption phases. In the setup, sorting a list of N of vertices offline requires $O (N log N)$ offline comparisons. During the encryption phase, each party encrypts the other party’s N labels and performs N XOR operations to retrieve its own encrypted labels.

The garbled circuit performs the following computations for the percolate-and-match algorithm. Merging two sorted lists of size N requires $O (N log N)$ oblivious comparisons using a Batcher merge. Each percolation phase requires computing C (the number of components) min-circuits over m-sized lists. We can find the minimum of a list with m elements using m comparisons; therefore, in total the min circuits require $N = C m$ comparisons per percolation phase. To perform each matching phase, we require $O (N)$ pairwise comparisons and updates. In addition, each Waksman network requires $O (N log N)$ oblivious swaps, and two permutation networks are computed per iteration. Therefore, each iteration of the loop requires $O (N log N)$ operations, and the iterative procedure loops $m - 1$ times. In total, the garbled circuit performs $O (N m log N)$ comparisons and swaps. The circuit must also compute $2 N$ conditional XOR operations for the first output to the two parties, and an addition N XORs for the final output.

The total cost of the protocol is dominated by the garbled circuit. The circuit size depends on the output length ℓ of the hash function H because each comparison is performed over ℓ-bit values. The total cost of the circuit is therefore $O (N m ℓ log N)$ gates. In Section 6.3, we show how to set ℓ as a function of the input size N and the tolerable correctness error ε. Specifically, we set $ℓ ⩾ ⌈ 2 log (2 N) - log (ε) - 1 ⌉$ , making the total size of the circuit $O (N m log (N) (log (N) + log (\frac{1}{ε})))$ gates.

6.2. Experiments

We implemented our protocol using Obliv-C [35] and Absentminded Crypto Kit [11]. We modified to Obliv-C to send batches of 500 gates at a time, rather than sending each gate as soon as it is ready; from this optimization we observed a 50% speedup. Our tests were performed in parallel on Google Compute Platform (GCP) on n1-highmem-32 (32 vCPUs with 208 GB memory) machines between pairs of machines in the same datacenter. El-Gamal operations were performed over elliptic curve secp256r1.

For each problem size, we ran our tests for three label lengths, each length corresponding to a parameterization of the correctness error, as explained in Section 6.3; specifically, we ran tests for parameters that bound the correctness error probability at $2^{- 40}$ , $2^{- 60}$ , and $2^{- 80}$ . For each problem size N and correctness parameter ε, we computed the requisite bit length $ℓ = ℓ (N, ε)$ and then rounded up the number of bits per vertex label to the next full byte (this is exactly the number of circuit wires per vertex label). We then instantiated the function H by truncating SHA256 outputs to the first ℓ bits.

Fig. 11.

Evaluation of our prototype implementation. We tested three different label lengths per input size; with lengths corresponding to correctness parameters (which upperbound the error probability) of $2^{- 40}$ , $2^{- 60}$ , and $2^{- 80}$ . In Fig. 11(a), we present results for all three lengths. In Figs 11(b), 11(c), and 11(d), we present results only for error probability of $2^{- 80}$ .

⁵

We measured the iteration time twice per experiment and take the average as the iteration time for that experiment.

We summarize our experimental results in Figs 11(a), 11(b), 11(c) and 11(d). We evaluated the performance of our generic the garbled circuit protocol (including outputting of masked labels), and present the results in Figs 11(a), Fig. 11(b), and Fig. 11(c). In the generic protocol, encryptions of output labels were computed using AES in CTR mode. Each test was performed with $m = 4$ (maximum component size); Fig. 11(c) contains the time per iteration of each percolate-and-match subcomponent, which can be used to roughly extrapolate to other values of m. All circuit sizes were run 5 times. In Fig. 11(d), we evaluate our El Gamal interface for composing with DDH-style PSI protocols. We benchmarked El-Gamal encryption for only three (smaller) problem sizes.

Importantly, our prototype implementation was not parallelized. We believe performance could be improved substantially by parallelizing, in particular because parts of the garbled circuit and the vast majority of the El-Gamal phase are embarrassingly parallel.

6.3. Fine-grained correctness

We observe that if an application that allows us to tune the correctness probability, we adapt the protocol to allow error with some tolerable probability ε in order to improve efficiency. Specifically, we show that we can set the bit length ℓ of the labels assigned to vertices as a function of ε. Reducing bit length of the labels directly improves the cost of the protocol, as the number of gates in the garbled circuit is linear in the bit length of the labels.

Recall that correctness of our protocol requires that for every pair of vertices u, v in $G = G_{p_{1}} \cup G_{p_{2}}$ , u and v assigned the same label if and only if they are in the same component in G. In our protocol, if u and v are in the same component in G, then they are assigned the same label by construction. If they are not in the same component, then they may be assigned the same label only if there is a spurious collision in the images of two unconnected vertices under H. Conditioned on the (tunable, and very small in our cases) probability of a spurious collision, the proof is of correctness the same.

Our analysis is an application of the birthday bound. Recall that each party has a graph of size N, and consider that H is a random oracle. Let $collision$ be the event that any two distinct vertices in G are mapped to the same image under H. By the birthday bound, it follows that $\begin{matrix} (1) & \Pr [collision] ⩽ \frac{{(2 N)}^{2}}{2 * 2^{ℓ}} \end{matrix}$ where N is the number of vertices in each participant’s graph, and ℓ is the output length of H.

Therefore, it is possible to achieve correctness with probability $1 - ε$ by upperbounding the probability of a spurious collision by ε. To do so, we set $ℓ = ⌈ 2 log (2 N) - log (ε) - 1 ⌉$ . In our experiments, we additionally round ℓ up to the next full byte, guaranteeing that the probability of collision is less than the desired parameter.

7. Discussion and future work

We have presented a two-party protocol that can be used as a setup for subsequent PSI-style computations. Our ID-agreement protocol was designed for use with DDH-style PSI protocols. In particular, we rely on the fact that in DDH-style protocols it is straightforward to work with ElGamal encryptions by taking advantage of the homomorphism over the group operation. We believe similar techniques can be applied to other PSI paradigms, which we leave for future work.

In a real-world application it is possible that the parties will update their respective databases and require new encrypted labels for their modified rows. One approach to computing the updated labels would be to run the entire protocol again, but this would be expensive if the updates occur frequently. More efficiently updating labels without scaning over bother parties’ entire inputs is an interesting future direction.

Subsequent to the initial publication of this work, Heath and Kolesnikov [17] introduced further garbled circuit optimizations that we believe would substantially improve the performance of our garbled circuit implementation. To improve the construction further, we consider the fact that more than half of the time spent in our experiments was spent on permuting the list of vertices in our circuit. An anonymous reviewer pointed out that there are more efficient ways obliviously permute lists, such as the work by Mohassel and Sadeghian [25], and that a technique such as this could reduce the complexity of the permutation step by an order of the security parameter. However, it is not trivial to perform oblivious updates on the label associated with each vertex by composing better shuffling techniques with garbled circuits for black-box updates. We consider this a direction for future research with many potential applications.

Footnotes

Acknowledgments

We would like thank Samee Zahur for his assistance with the Obliv-C compiler and Jack Doerner for his assistance with Absentminded Crypto Kit.

References

Agrawal,

Evfimievski and

Srikant, Information sharing across private databases, in: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD ’03, ACM, New York, NY, USA, 2003, pp. 86–97. doi:10.1145/872757.872771.

Asharov,

Komargodski,

W.K.

Lin,

Nayak,

Peserico and

Shi, Optorama: Optimal oblivious ram, Cryptology ePrint Archive, Report 2018/892, 2018, https://eprint.iacr.org/2018/892.

K.E.

Batcher, Sorting networks and their applications, in: Proceedings of the April 30–May 2, 1968, Spring Joint Computer Conference, ACM, 1968, pp. 307–314.

Beauquier and

É.

Darrot, On arbitrary size waksman networks and their vulnerability, Parallel Processing Letters12(03n04) (2002), 287–296. doi:10.1142/S0129626402000999.

Bellare,

V.T.

Hoang and

Rogaway, Foundations of garbled circuits, Cryptology ePrint Archive, Report 2012/265, 2012, https://eprint.iacr.org/2012/265.

Buddhavarapu,

Knox,

Mohassel,

Sengupta,

Taubeneck and

Vlaskin, Private matching for compute, Cryptology ePrint Archive, Report 2020/599, 2020, https://eprint.iacr.org/2020/599.

Chmielewski and

J.H.

Hoepman, Fuzzy private matching, in: Availability, Reliability and Security, 2008. ARES 08. Third International Conference on, IEEE, 2008, pp. 327–334. doi:10.1109/ARES.2008.170.

Ciampi and

Orlandi, Combining private set-intersection with secure two-party computation, in: Security and Cryptography for Networks – 11th International Conference, SCN 2018, Amalfi, Italy, September 5–7, 2018,

Catalano and

R.D.

Prisco, eds, Proceedings. Lecture Notes in Computer Science, Vol. 11035, Springer, 2018, pp. 464–482. doi:10.1007/978-3-319-98113-0_25.

Dachman-Soled,

Malkin,

Raykova and

Yung, Efficient robust private set intersection, in: International Conference on Applied Cryptography and Network Security, Springer, 2009, pp. 125–142. doi:10.1007/978-3-642-01957-9_8.

10.

De Cristofaro,

Kim and

Tsudik, Linear-complexity private set intersection protocols secure in malicious model, in: International Conference on the Theory and Application of Cryptology and Information Security, Springer, 2010, pp. 213–231.

11.

Doerner, Absentminded crypto kit, 2017.

12.

Dong,

Chen and

Wen, When private set intersection meets big data: An efficient and scalable protocol, in: Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, CCS ’13, ACM, New York, NY, USA, 2013, pp. 789–800. doi:10.1145/2508859.2516701.

13.

B.H.

Falk,

Noble and

Ostrovsky, Private set intersection with linear communication from general assumptions, Cryptology ePrint Archive, Report 2018/238, 2018, https://eprint.iacr.org/2018/238.

14.

M.J.

Freedman,

Nissim and

Pinkas, Efficient private matching and set intersection, in: EUROCRYPT, Lecture Notes in Computer Science, Vol. 3027, Springer, 2004, pp. 1–19.

15.

T.E.

Gamal, A public key cryptosystem and a signature scheme based on discrete logarithms, IEEE Trans. Inf. Theory31(4) (1985), 469–472. doi:10.1109/TIT.1985.1057074.

16.

He,

Machanavajjhala,

Flynn and

Srivastava, Composing differential privacy and secure computation: A case study on scaling private record linkage, in: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, ACM, New York, NY, USA, 2017, pp. 1389–1406. doi:10.1145/3133956.3134030.

17.

Heath and

Kolesnikov, Stacked garbling: Garbled circuit proportional to longest execution path, Cryptology ePrint Archive, Report 2020/973, 2020, https://eprint.iacr.org/2020/973.

18.

Huang,

Evans and

Katz, Private set intersection: Are garbled circuits better than custom protocols? in: 19th Annual Network and Distributed System Security Symposium, NDSS 2012, San Diego, California, USA, February 5–8, 2012, 2012, pp. 5–8, http://www.internetsociety.org/private-set-intersection-are-garbled-circuits-better-custom-protocols .

19.

B.A.

Huberman,

Franklin and

Hogg, Enhancing privacy and trust in electronic communities, in: Proceedings of the 1st ACM Conference on Electronic Commerce, ACM, 1999, pp. 78–86. doi:10.1145/336992.337012.

20.

Indyk and

Motwani, Approximate nearest neighbors: Towards removing the curse of dimensionality, in: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, ACM, 1998, pp. 604–613.

21.

Ion,

Kreuter,

Nergiz,

Patel,

Saxena,

Seth,

Shanahan and

Yung, Private intersection-sum protocol with applications to attributing aggregate ad conversions, Tech. rep., Cryptology ePrint Archive, Report 2017/738, 2017.

22.

Lambæk, Breaking and fixing private set intersection protocols, Tech. rep., Cryptology ePrint Archive, Report 2016/665, 2016, http://eprint.iacr.org/2016/665.

23.

K.G.

Larsen and

J.B.

Nielsen, Yes, there is an oblivious ram lower bound! in: Annual International Cryptology Conference, Springer, 2018, pp. 523–542.

24.

Lindell and

Pinkas, A proof of security of Yao’s protocol for two-party computation, J. Cryptol.22(2) (2009), 161–188. doi:10.1007/s00145-008-9036-8.

25.

Mohassel and

Sadeghian, How to hide circuits in mpc: An efficient framework for private function evaluation, Cryptology ePrint Archive, Report 2013/137, 2013, https://eprint.iacr.org/2013/137.

26.

Pinkas,

Schneider,

Segev and

Zohner, Phasing: Private set intersection using permutation-based hashing, in: 24th USENIX Security Symposium (USENIX Security 15), USENIX Association, Washington, D.C., 2015, pp. 515–530, https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/pinkas .

27.

Pinkas,

Schneider,

Tkachenko and

Yanai, Efficient circuit-based psi with linear communication, Cryptology ePrint Archive, Report 2019/241, 2019, https://eprint.iacr.org/2019/241.

28.

Pinkas,

Schneider,

Weinert and

Wieder, Efficient circuit-based psi via cuckoo hashing, Cryptology ePrint Archive, Report 2018/120, 2018, https://eprint.iacr.org/2018/120.

29.

Pinkas,

Schneider and

Zohner, Faster private set intersection based on ot extension, in: Usenix Security, Vol. 14, 2014, pp. 797–812.

30.

Rindal and

Rosulek, Improved private set intersection against malicious adversaries, Tech. rep., 2016.

31.

Segal,

Ford and

Feigenbaum, Catching bandits and only bandits: Privacy-preserving intersection warrants for lawful surveillance, in: FOCI, 2014.

32.

Waksman, A permutation network, Journal of the ACM (JACM)15(1) (1968), 159–163. doi:10.1145/321439.321449.

33.

Wen and

Dong, Efficient protocols for private record linkage, in: Proceedings of the 29th Annual ACM Symposium on Applied Computing, ACM, 2014, pp. 1688–1694. doi:10.1145/2554850.2555001.

34.

A.C.

Yao, Protocols for secure computations, in: Proceedings of the 23rd Annual Symposium on Foundations of Computer Science, SFCS ’82, IEEE Computer Society, Washington, DC, USA, 1982, pp. 160–164. doi:10.1109/SFCS.1982.88.

35.

Zahur and

Evans, Obliv-c: A language for extensible data-oblivious computation, IACR Cryptology ePrint Archive2015 (2015), 1153.