Active network alignment: a matching-based approach

4 downloads 45646 Views 529KB Size Report
Oct 18, 2016 - logical networks to social networks to ontologies—where multiple ... the most informative nodes to query based on the top-k ... SI] 18 Oct 2016 ...
Active network alignment: a matching-based approach Eric Malmi∗

Evimaria Terzi†

arXiv:1610.05516v1 [cs.SI] 18 Oct 2016

Abstract Network alignment is the problem of matching the nodes of two graphs, maximizing the similarity of the matched nodes and the edges between them. This problem is encountered in a wide array of applications—from biological networks to social networks to ontologies—where multiple networked data sources need to be integrated. Due to the difficulty of the task, an accurate alignment can rarely be found without human assistance. Thus, it is of great practical importance to develop network alignment algorithms that can optimally leverage experts who are able to provide the correct alignment for a small number of nodes. Yet, only a handful of existing works address this active network alignment setting. The majority of the existing active methods focus on absolute queries (“are nodes a and b the same or not?”), whereas we argue that it is generally easier for a human expert to answer relative queries (“which node in the set {b1 , . . . , bn } is the most similar to node a?”). This paper introduces a novel relative-query strategy, TopMatchings, which can be applied on top of any network alignment method that constructs and solves a bipartite matching problem. Our method identifies the most informative nodes to query based on the top-k matchings of a bipartite graph. We compare the proposed approach to several commonly-used query strategies and perform experiments on both synthetic and real-world datasets. Our matching-based strategy yields the highest overall performance, outperforming all the baseline methods by more than 15 percentage points in some cases. 1

Introduction

The network-alignment problem asks to find a matching between the nodes of two graphs so that both (i) node-attribute similarities and (ii) structural similarities between the matched nodes are maximized. This is an ubiquitous problem with application areas ranging from biological networks [6] to social networks [12, 27], ontologies [24], and image matching in computer vision [7]. For instance, in the case of social networks, one ∗ Aalto

University, [email protected] University, [email protected] ‡ Aalto University, [email protected] † Boston

Aristides Gionis‡

might be interested in aligning the friendship graphs of two social-networking services in order to suggest new friends for the users. Typically, some of the network nodes are easy to align automatically, if they, for example, share a unique name. Other nodes can be considerably more ambiguous and thus fully-automatic methods are likely to align them incorrently. In the active version of the network-alignment problem, such difficult cases are redirected to human experts who act as oracles. In this way, the aligment process is judiciously enhanced by involving humans in the loop. The idea of algorithms that select which data to be labeled in order to improve accuracy is not new; in fact, this is the main focus of active learning. Research in this area aims to identify effective ways for utilizing access to labeling oracles, such as human experts. Although there is a lot of research in active learning that aims to improve the performance of classification or clustering algorithms, few studies have tackled the specific problem of active network alignment. To the best of our knowledge, active networkalignment methods appear mostly in the domain of ontology matching [13, 21, 23]. These methods ask the human experts to assess whether two given nodes are a match or not, and thus, focus on identifying the most useful pair of nodes to query. The limitation of this approach is that absolute yes/no questions can be very hard to answer for a human if no context about alternative candidate matches is provided. In this paper, we obtain human feedback, where questions are asked in the following form: “Given node v and a set of candidate matches C, which node in C is the most likely match for v?”. To answer this type of relative questions, the expert only needs to make comparative (and no absolute) judgments, which are less challenging for humans [16]. Given the above querying scheme, our framework for active network alignment is based on a novel algorithmic idea for identifying the best questions to ask from the experts. Since access to experts is typically costly, the objective is to maximize the alignment accuracy for a given number of queries. Following a well-established principle in active learning, the best data points to label are the ones for

which the current model is least certain as to what the correct output should be. Accordingly, we introduce novel measures for quantifying the uncertainty introduced by each node in the network-alignment process. Using these measures we ask queries that resolve most of the uncertainty, and therefore, only few queries suffice to obtain an alignment of high accuracy. In the classic (non-active) network-alignment problem the input consists of two networks: a source network Gs = (Vs , Es ) and a target network Gt = (Vt , Et ). Finding an alignment is often reduced into the problem of finding a matching on a weighted bipartite graph H = (Vs , Vt , Eh ), where the weights incorporate both attribute and structural similarities. Examples of such methods include Natalie [10, 14], NetAlignMP++ [3], IsoRank [25], and Final [28]. Here, we propose an active alignment method with relative queries, TopMatchings, which can be employed on top of any such matching-based network alignment method. TopMatchings identifies the most uncertain nodes to query based on the set M` of the top-` maximum-weight matchings in H. Given these matchings, TopMatchings picks to query the node v ∈ Vs for which the matchings in M` have the least consensus with respect to the matching node of v. The intuition is that the more diverse the set of matches for v, the more uncertainty v introduces. Our experiments with real and synthetic data show that TopMatchings performs consistently well and in some cases it outperforms our baseline methods by more than 15 percentage points. The baseline methods include three previously proposed query strategies and random querying. Our main contributions are summarized as follows. • We formalize a relative-judgment framework for active network alignment. • We develop an active-querying algorithm, TopMatchings, which can be employed on top of any network-alignment method that finds the alignment using maximum-weight bipartite matching. Indeed, several state-of-the-art network-alignment methods follow this approach. • We conduct experiments with real and synthetic datasets demonstrating the superiority of TopMatchings compared to several previously proposed baseline methods. We also show that TopMatchings can be parallelized without significantly compromising the accuracy of the method. • The code and the data used in the experiments are publicly available at: https://github.com/ekQ/active-network-alignment

Roadmap: The paper is organized as follows. In Section 2 we review the related work. Our problem formulation is provided in Section 3 and our algorithm in Section 4. Section 5 presents the experimental evaluation of our method and comparison with baselines. We conclude in Section 6. 2

Related work

Numerous methods have been developed for the nonactive network-alignment problem. Popular methods include Natalie, which is one of the best performing methods according to several independent studies [3, 6, 10], IsoRank [25] and Graemlin [11] developed in the context of biological-network alignment, Coma++ [1] and Glue [9] developed for ontology matching, and User-Matching [15] developed for social-network matching. The problem of active network alignment has been previously studied by Cort´es and Serratosa [8]. Compared to our work, they focus on a more limited class of network-alignment methods that return a probability matrix for different matches, making the quantification of uncertainty more straightforward. However, some of their query strategies are also applicable to our setting and thus we adopt two baseline strategies from their work, namely lccl and Margin. Another closelyrelated line of work is active ontology matching. Shvaiko and Euzenat [24] list this as one of the important areas for future work in their recent survey on ontology matching. As mentioned in the introduction, relative queries are easier for experts to answer, whereas the existing work on active ontology matching focuses on absolute queries [13, 21, 23]. Active learning approaches have been developed also for other related problems. Charlin et al. [5] develop an active learning method for many-to-many matching problems encountered in recommender systems, comparing different absolute-querying strategies. Another problem where active learning has been studied extensively and where data is in a network format is the inference problem for Gaussian fields [17], [29]. Macskassy [17] finds an empirical risk minimization (erm) to be the best method to find the next instance to query but due to the high computational complexity of erm, he proposes to use the betweenness centrality as a filter to select a subset of nodes for which erm is applied to. Finally, Bilgic et al. [4] introduce an active learning method for collectively classifying networked data. 3

Problem formulation

Before defining the active network alignment problem, we first discuss the non-active version of the problem.

C

C1

C2 B1

B A1

A

B2 A2

A3

indicate the sets of matching nodes. Indicatively, in an application of aligning two social networks, one can think of A = Andres, A1 = Andre, A2 = Andrew, A3 = Andreas, while B = Brendon, B1 = Brenden, B2 = Brendan, etc.

Active network alignment: Assume now that we have access to an oracle, for example, a human expert, with whom we can interact in the form of queries and Figure 1: An illustration of the network-alignment obtain partial information about the correct alignment problem. The task is to align two input networks Gs of the two networks Gs and Gt . As already discussed, we focus on interaction with and Gt . Some nodes and/or edges from either network the oracle that takes the form of the following queries: may be left unmatched. Letters and colors are used to indicate the sets of candidate matching nodes, e.g., Given a node v in the source network G , and a set of s CA = {A1 , A2 , A3 }, CB = {B1 , B2 }, and CC = {C1 , C2 }. candidate matches C in the target network G , which Gs

Gt

v

t

node from Cv should be matched to v? Network alignment: In the standard networkalignment problem we consider two input graphs Gs = (Vs , Es ) and Gt = (Vt , Et ), the source and the target graph. The adjacency matrices of the two graphs are denoted by As and At , respectively. Throughout we assume that the smaller graph is aligned to the larger one, that is, |Vs | ≤ |Vt |. In addition, we consider that a similarity function σ : Vs × Vt → [0, 1] is available, which measures the similarity between pairs of nodes i ∈ Vs and j ∈ Vt . The similarity function σ depends on the application at hand. The objective of the network-alignment problem is to find a matching between the nodes of the two networks. More specifically, we want to align the nodes of the source graph Vs to the nodes of the target graph Vt . Formally, we want to find M = {(i, j)} ⊆ Vs × Vt so that each node of Vs and Vt appears at most one time in M . A high-quality alignment should satisfy the following properties: (i) nodes in Vs should match to similar nodes in Vt , according to σ, and (ii) the endpoints of each edge in Es should match to nodes in Vt that are connected by edges in Et . For each node v ∈ Vs we consider a set of candidate matching nodes Cv ⊆ Vt . Those are the only nodes in Vt that v can be matched to. In this paper we assume that the sets Cv are given. In practice, the sets Cv are computed by a simple heuristic, e.g., considering all nodes in Vt whose feature-based similarity to v ∈ Vs exceeds a certain threshold. When candidate sets are small compared to Vt , the problem is sometimes called sparse network alignment. Global network-alignment methods formulate an objective function that captures the above properties, and then devise algorithms for optimizing this objective. An example of the network-alignment problem is shown in Figure 1, where we use letters and colors to

The oracle returns an answer u ∈ Cv and the algorithm proceeds to aligning Gs with Gt given that v ∈ Vs is matched with u ∈ Vt . The problem of active network alignment is to select the node v ∈ Vs for which to ask to the oracle to reveal the correct matched node u ∈ Vt . A natural way to formulate this problem is to design a function Cert (v | Gs , Gt ) that quantifies the certainty associated with each v ∈ Vs with respect to its match in Vt . Nodes with low certainty scores are good candidates for being the query node. The active-learning framework provides general principles for designing such a function. We deploy these ideas in order to design and experiment our Cert function. More details on this are given in Sections 4 and 5. In the example of Figure 1, we see that A in Gs is most similar to A1 , A2 , and A3 in Gt . If A is matched to A1 or A2 , then B should be uniquely matched to B1 . If A is matched to A3 , then B should be uniquely matched to B2 and C to C2 . Thus, matching node A first, to a large extent, determines the rest of the alignment. On the other hand, matching B or C first does not determine the rest of the alignment with the same level of certainty. We conclude that, in this example, it is a good strategy to ask the oracle to provide us with the correct alignment for node A. 4 Matching-based active network alignment In this section, we present the proposed strategy for active network alignment. This strategy is based on a novel node-query method named TopMatchings. We first introduce the high-level approach and then proceed to describe each of its components in detail. 4.1 Overview. The general active network alignment approach, in which nodes are queried iteratively, can be summarized as follows: In every iteration pick a

Algorithm 1 The general active network alignment framework. Input: Gs = (Vs , Es ), Gt = (Vt , Et ), candidate matches C, and query budget k. Output: An alignment between Vs and Vt . 1: for i = 1, . . . , k do 2: H = NetworkAlignment(Gs , Gt | C) 3: vˆ = argminv∈Vs Cert (v | H, Gs , Gt ) 4: u ˆ = OracleQuery(ˆ v , Cvˆ ), with u ˆ ∈ Cvˆ 5: Cvˆ = {ˆ u} . Update candidate matches. 6: for v ∈ Vs \ {ˆ v } do 7: Cv = Cv \ {ˆ u} 8: H = NetworkAlignment(Gs , Gt | C) 9: return BipartiteMatching(H)

Both methods aim to solve the following quadratic integer program adopting, however, different approaches. Integer-programming formulation: The IP formulation of Natalie and NetAlignMP++ introduces a variable xij for each pair of nodes i ∈ Vs and j ∈ Vt . The variable xij is set to 1 if i is matched to j, while it is set to 0 otherwise. We also use ∆(v), for v ∈ Vs ∪ Vt , to denote the set of all pairs {(i, j)} for which v = i or v = j. X max σ(i, j)xij x

(i,j)∈Vs ×Vt

X

+g

X

A1;ik A2;j` xij xk` ,

(i,j)∈Vs ×Vt (k,`)∈Vs ×Vt

such that

X

xij ≤ 1,

for all v ∈ Vs ∪ Vt ,

(i,j)∈∆(v)

xij ∈ {0, 1}, node vˆ ∈ Vs . The node vˆ together with its set of candidate matching nodes Cvˆ ⊆ Vt is shown to the (human) oracle, and the oracle selects a node u ˆ ∈ Cvˆ as the best matching for vˆ. Assert that u ˆ is the best matching for vˆ by updating the candidate node sets so that Cvˆ = {ˆ u}, and removing u ˆ from all other sets of candidate nodes. When the budget of oracle queries is exhausted, solve the remaining network alignment problem to align the unqueried nodes. Pseudocode for this approach is shown in Algorithm 1. Our main contribution, TopMatchings, is a method for quantifying the certainty of the candidate nodes to be queried at every iteration of Algorithm 1. TopMatchings consists of the following steps. Step 1. Construct a weighted bipartite graph H = (Vs , Vt , Eh ) that forms the basis for the matchingbased network-alignment algorithm. Step 2. Find the set M` of the top-` maximum-weight matchings in H. Step 3. For each node v ∈ Vs , estimate the certainty we have about the correct match for v. These certainty values, Cert(v), are computed using the information in M` . Step 4. Identify the node vˆ ∈ Vs with the least certainty, and query the oracle for selecting the best match of vˆ among the set of candidate matching nodes Cvˆ ⊆ Vt . Each of the above steps are discussed in more detail in the following subsections. 4.2 Constructing the bipartite graph H. Our strategy is applicable to any NetworkAlignment algorithm that is based on solving a bipartite-matching problem. In our experiments, we use Natalie [10] and NetAlignMP++ [3], two state-of-the-art algorithms.

for all (i, j) ∈ Vs × Vt .

In the above integer program, each pair of matched nodes i and j contributes a reward of value σ(i, j), while an additional reward of value g is given if an edge in Es is matched to an edge in Et . The parameter g is userdefined and it quantifies the relative importance between correctly-matched nodes and correctly-matched edges. The inequality constraint ensures that the solution is a matching. Solution via Lagrangian relaxation: Natalie, originally proposed by Klau [14], solves the quadratic integer program by first linearizing it and then employing a Lagrangian relaxation technique. Klau shows that the original integer program is NP-hard, but remarkably, by relaxing a symmetry constraint for the linearized quadratic terms, the problem becomes solvable in polynomial time via multiple maximum-weight bipartite matchings. Natalie iteratively updates the Lagrangian multipliers λ for the relaxed constraints using subgradient optimization. The solutions of the relaxed problem provide upper bounds for the original problem, whereas the feasible solutions, which can be directly extracted from the relaxed solutions, provide lower bounds. The best feasible solution found provides the bipartite graph H for our active strategy. Solution via message passing: The NetAlignMP++ algorithm, proposed by Bayati et al. [3], solves the same optimization problem using a belief propagation (BP) approach. This approach makes local, greedy updates by passing messages between the neighboring nodes. To obtain an integral solution, NetAlignMP++ constructs and solves a maximum-weight matching problem based on the BP messages at every iteration of the algorithm. Again, we set H to correspond to the matching problem that gives the best solution.

4.3 Computing top-` maximum-weight matchings. To compute the ` maximum-weight matchings M` , we use an algorithm invented by Murty [20] in 1968. This algorithm first computes the best maximumweight matching in H, then splits the problem into O(n) smaller matching problems and reconstructs the second best matching based on the solutions of the smaller problems. The process is repeated until ` matchings have been discovered, and in total, the algorithm has a running time of O(`n4 ), where n is the total number of nodes in the graph, i.e., in our case n = max{|Vs |, |Vt |}. Using a modification by Miller et al. [19], the algorithm runs in time O(`n3 ).

vˆ with the least certainty, i.e., (4.2)

vˆ = argminv∈Vs Cert(v).

4.5 Batch querying. Instead of finding the best query to ask at every iteration of Algorithm 1 we can adopt a “batch” approach, where in each iteration we identify a batch of queries to ask—the ones that correspond to the nodes with the smallest Cert values. When considering batches of size k 0 < k, where k is the total number of queries to be made, we can achieve a speedup of the order d kk0 e. We can readily employ all the query strategies studied in this paper to query in batches. However, these 4.4 Quantifying certainty and identifying the strategies might yield nodes that strongly depend on node to query. Using uncertainty to determine which each other (that is, if we knew the alignment for one data points should be labeled is a standard active- node, we could easily align another node). Thus, devellearning strategy [22]. The idea is to select for labeling oping more sophisticated methods for batch querying the data points for which we are the least certain as to remains an interesting avenue for future research. what the correct output should be. To obtain intuition for the proposed method, con- 5 Experimental evaluation sider again the example in Figure 1. The bipartite graph We evaluate a number of different network-alignment H for this example is shown on the left in Figure 2. For query strategies on networks where the correct alignsimplicity we have not drawn the edges of H. ment is known. Our methodology is as follows: First, Now assume that we compute the top-` maximum- we solve the network-alignment problem with a nonweight matchings in this bipartite graph. The top-5 active alignment method and report the alignment acmatchings are shown in Figure 2. For each node v ∈ Vs curacy, that is, the fraction of correctly aligned nodes. we can compute the distribution of the nodes in Vt that Then we start simulating the oracle queries by fixing v is matched in these top-` maximum-weight matchings. the nodes to their correct matches. After each query we A node with high uncertainty is a node that is matched solve the network-alignment problem given the current to many different nodes in Vt and the distribution of correct matches and report the alignment accuracy on matches is well-balanced. In the example of Figure 2 the unqueried nodes. This process is repeated for each the distribution of matchings for nodes A, B, and C query strategy separately with the same initial graphs. is {A1 : 2, A2 : 2, A3 : 1}, {B1 : 4, B2 : 1}, and A good strategy should use as few queries as possible to {C1 : 2, C2 : 3}, respectively. From these distributions reach a certain value of alignment accuracy. one can argue that node A has the highest uncertainty. As mentioned earlier, the query strategies discussed To formalize the idea, let us denote by M` the set of in this paper can be applied on top of any (non-active) top-` maximum-weight matchings in H. For each v ∈ Vs network-alignment method that results in solving a and u ∈ Vt we define fv (u) to be the number of times bipartite-matching problem. In the experiments, we that node v is matched to u in the set of matchings M` . employ the query strategies on top of two state-of-theWe then define the certainty of node v ∈ Vs as art alignment methods, Natalie [10, 14] and NetAlignMP++ [3] (see Section 4.2 for a brief discussion). Next we describe our datasets and present an em(4.1) Cert(v) = max fv (u). u∈Vt pirical comparison between TopMatchings and four baseline query strategies. The intuition for Equation (4.1) is that when a node v is matched to the same node u in most of the matchings in M` , then there is little uncertainty; based on the evidence in M` , u is a good match for v and no extra information will be gained by querying v. On the other hand, if the evidence in M` is inconclusive with respect to the best match for v, then v should be queried. Accordingly, the proposed strategy is to query the node

5.1 Datasets. We use three types of datasets. In all datasets the matching of the nodes between the two graphs is based on attributes, and there is a sufficient degree of ambiguity so that for each node in the one graph there are several candidate matches in the other graph. Also the edges may have been corrupted.

top-5 matchings

Bipartite graph used for matching C B A

C2 C1 B2 B1 A3 A2 A1

C B A

C2 C1 B2 B1 A3 A2 A1

C B A

C2 C1 B2 B1 A3 A2 A1

C B A

C2 C1 B2 B1 A3 A2 A1

C B A

C2 C1 B2 B1 A3 A2 A1

C B A

C2 C1 B2 B1 A3 A2 A1

Figure 2: The instance of the network-alignment problem in Figure 1 is transformed to an instance of a maximumweight matching problem. Computing the set M` of top-` matchings helps to quantify the uncertainty of a node in network alignment.

Preferential-attachment graphs: We generate a network using the preferential-attachment model [2] with 1000 nodes and 2 edges per new node. Then we sample a label for each node from a set of 33 unique labels. The labels are treated as attributes and the similarity between two nodes is set to 1 if their labels are the same and 0 otherwise. Only the nodes with the same label are considered candidate matches, resulting in 30 candidate matches per node on average. We make two copies of the network and in each copy we independently corrupt the edges to further complicate the alignment task. We first discard 60 % of the graph edges chosen at random and then add 50 % more edges again selected at random. Social networks: We use the the multiplex dataset from Aarhus University [18], which contains five networks between the employees of the Department of Computer Science. We select two of these layers, the Facebook and the lunch networks, and try to align the former to the latter. Since the dataset only contains anonymized user identifiers, we again sample a label for each node so that on average there are three people with the same label. The lunch network contains 60 people, whereas the Facebook network covers only a subset of 32 of them. While this dataset is small, it makes an interesting case study since it contains two very different types of real-world networks and the correct alignment between them is known.

64 208 people constructed by an individual genealogical researcher. We sample a subgraph of this network by randomly picking a seed person and doing a random walk until 1000 distinct people have been discovered. For each person we only consider the first name, last name, and birth year. Birth year is corrupted by rounding it to the nearest ten. Some of the first and last names are randomly replaced by their alternative spellings based on a list of common name variations. For example, the name Felix Ahlrooth might get replaced by Feeliks Alroot. Such name variations are frequently encountered in historical documents used in genealogical research. From both graphs independently, we discard 50 % of the edges at random. The task is to align the subgraph into the larger graph. When selecting candidate entities for each individual in the subgraph, we find people from the larger graph born in the same decade and select four of them with the most similar names in addition to the groundtruth match. Name similarity is computed as the average Jaro-Winkler similarity [26] between the first names and between the last names.2 5.2 Baseline query strategies. We compare the performance of the TopMatchings query strategy to four baseline strategies.

Margin: Querying data samples with a small score difference (margin) between the two most probable Family trees: Aligning family trees is an import prob- labels is a common approach in active learning [22]. In lem encountered in various online genealogy services the context of active network alignment, this method where different people provide family tree fragments, has been previously used by Cort´es and Serratosa [8]. Thus, given the bipartite graph H, Margin comwhich then, ideally, get merged into a single large fam1 putes Cert(v) for every v ∈ Vs by first finding the two ily tree. We have obtained a family tree containing edges incident to v with the largest weights w1 (v) and 1 Note

that in the graph-theoretic sense family trees are not trees since they contain cycles such as Mother–FirstChild–Father– SecondChild–Mother.

2 The Jaro-Winkler similarity is a popular choice for deduplicating name records.

w2 (v) in H. Then the node is scored according to

Table 1: Running times for TopMatchings (using an unoptimized version of the Murty’s algorithm with a complexity of O(`n4 )) and for the network alignment Intuitively, the larger the difference between w1 (v) and method Natalie. w2 (v), the less uncertainty there is with respect to the # of nodes Query time (sec) Alignment time (sec) best match for node v. 100 1.2 3.9 lccl: The Least Confident given the Current Labelling 1 000 21.7 11.0 query strategy ranks nodes according to 10 000 9240.5 48.1 Cert(v) = w1 (v) − w2 (v).

Cert(v) = Hv,M (v) , where M is the current matching (the maximum-weight matching in H) and Hv,M (v) is the weight between node v and its current match. This method was reported to yield the highest precision of the four uncertainty sampling based methods studied in [8]. Betweenness: This method queries a previously unqueried node with the highest betweenness centrality in Gs . The method has been previously used in [17]. Random: This method queries nodes from Vs in a random order.

This is probably explained by the fact that Betweenness queries central nodes that have many neighbors and are thus less uncertain since the neighbors help inferring their correct alignment. We conclude that the best overall performance is obtained using the TopMatchings method; lccl and Margin are both competitive baselines but in some cases they yield more than 15 percentage point lower accuracies, whereas TopMatchings performs consistently well.

5.3 Query strategy comparison. Here we perform an empirical comparison of the different query strategies. For each dataset, we run a minimum of 50 random initializations of the input graphs and average the results; for the preferential-attachment graphs, we generate a new pair of input graphs at each initialization, for social networks, we sample new node labelings and for family trees a new subgraph at each initialization. The number of matchings ` in TopMatchings is set to 30. We also tried larger values up to 3000 but this did not seem to have a significant effect on the results. The alignment accuracies are shown in Figures 3, 4, and 5. On the x-axis, we have the number of queried nodes and on the y-axis, the alignment accuracy for the unqueried nodes, that is, the fraction of correctly aligned unqueried nodes. In Figure 3 we see that by querying the 400 best nodes according to TopMatchings or lccl the remaining 600 nodes get correctly aligned when using Natalie, whereas the other query strategies require at least 800 queries for obtaining a perfect alignment. For social networks (Figure 4) Margin clearly outperforms lccl, and TopMatchings outperforms both of them except for when querying 26 nodes or more and using NetAlignMP++. Finally, in family trees (Figure 5), TopMatchings consistently outperforms the other methods when using Natalie, whereas with NetAlignMP++, lccl and Margin yield slightly higher accuracies after querying half of the nodes. A surprising observation is that in most cases Betweenness is outperformed even by random querying.

5.4 Batch querying and scalability. Here we discuss the effect or issuing queries in parallel (as discussed in Section 4.5) in the performance of TopMatchings. More specifically, Figures 3, 4, and 5 show the effect of batch querying for the TopMatchings method using a batch size of 10. In some cases the accuracy is hardly affected (Figure 3) whereas in other cases it drops by 4 percentage points (Figure 4). Even the batch version of TopMatchings yields a better performance than the other methods in most cases. Another potential bottleneck is running TopMatchings to compute the nodes to query. Table 1 lists the running times for computing a query node and for running Natalie when aligning two preferentialattachment graphs with a varying number of nodes. With networks of up to 1 000 nodes, TopMatchings is not yet a significant bottleneck but with 10 000 nodes, issuing TopMatchings takes already 2.6 hours, whereas running Natalie with 300 subgradient optimization steps only takes 48 seconds. Note, however, that these results were obtained using the original version of Murty’s algorithm with a complexity of O(`n4 ) since it was sufficiently fast for the problems studied in this paper and simpler to implement. For larger networks one should employ the optimized version with a complexity of O(`n3 ) [19] or parallelize the naive algorithm which solves n subproblems that are independent of each other. Finally, we also discovered that Natalie can be optimized when employing it multiple times for a sequence of similar problem which is the case when we

Natalie

NetAlignMP++

1

Accuracy (unqueried nodes)

Accuracy (unqueried nodes)

1 0.9 0.8 0.7

TopMatchings LCCL Margin Betweenness Random TopMatchingsBatch

0.6 0.5

0.9 0.8 0.7 0.6 0.5

0

200

400

600

800

1000

0

200

# of queried nodes

400

600

800

1000

# of queried nodes

Figure 3: Alignment accuracy of the unqueried nodes for preferential-attachment graphs. Natalie

NetAlignMP++

1

Accuracy (unqueried nodes)

Accuracy (unqueried nodes)

1 0.95 0.9 0.85 0.8

TopMatchings LCCL Margin Betweenness Random TopMatchingsBatch

0.75 0.7 0.65

0.95 0.9 0.85 0.8 0.75 0.7 0.65

0

5

10

15

20

25

30

0

5

# of queried nodes

10

15

20

25

30

# of queried nodes

Figure 4: Alignment accuracy of the unqueried nodes for social networks. Natalie

NetAlignMP++

1

Accuracy (unqueried nodes)

Accuracy (unqueried nodes)

1

TopMatchings LCCL Margin Betweenness Random TopMatchingsBatch

0.95 0.9 0.85 0.8 0.75

0.95 0.9 0.85 0.8 0.75

0

200

400

600

# of queried nodes

800

1000

0

200

400

600

800

1000

# of queried nodes

Figure 5: Alignment accuracy of the unqueried nodes for genealogical networks. iteratively query a node and fix the alignment of the pliers corresponding to the relaxed constraints by the queried node but otherwise keep the problem unaltered. optimal values of the multipliers in the previous run of This is achieved by initializing the Lagrangian multi- Natalie. In our initial experiments, this strategy pro-

vided speed ups of more than 60 % for the convergence of the subgradient optimization. However, since Natalie is not typically the bottleneck of the proposed active network alignment approach, we have omitted these results. 6

Conclusions

In this paper we formalized an active-learning framework, which allows us to incorporate human feedback into the network alignment problem (also known as graph matching). In this framework, we obtain human feedback by asking relative queries. That is, we query nodes rather than edges, making interaction with experts much easier, whereas most of the existing works on active network alignment query edges. Moreover, we develop TopMatchings, an algorithm that relies on quantifying the alignment uncertainty of a node by exploring the space of matchings in a transformed bipartite graph. In our experimental evaluation, TopMatchings outperforms several previously proposed baseline methods. Finally, we show that TopMatchings can be optimized by leveraging previous alignment solutions and by batch computation. Acknowledgments We would like to thank Pekka Valta and the Genealogical Society of Finland for providing the data for the family-tree experiment. References

[1] D. Aumueller, H.-H. Do, S. Massmann, and E. Rahm. Schema and ontology matching with coma++. In SIGMOD, 2005. [2] A.-L. Barab´ asi and R. Albert. Emergence of scaling in random networks. Science, 286(5439), 1999. [3] M. Bayati, D. F. Gleich, A. Saberi, and Y. Wang. Message-passing algorithms for sparse network alignment. TKDD, 7(1):3, 2013. [4] M. Bilgic, L. Mihalkova, and L. Getoor. Active learning for networked data. In ICML, 2010. [5] L. Charlin, R. Zemel, and C. Boutilier. Active learning for matching problems. In ICML, 2012. [6] C. Clark and J. Kalita. A comparison of algorithms for the pairwise alignment of biological networks. Bioinformatics, 30(16), 2014. [7] D. Conte, P. Foggia, C. Sansone, and M. Vento. Thirty years of graph matching in pattern recognition. IJPRAI, 18(3), 2004. [8] X. Cort´es and F. Serratosa. Active-learning query strategies applied to select a graph node given a graph labelling. In Graph-Based Representations in Pattern Recognition. 2013.

[9] A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Ontology matching: A machine learning approach. In Handbook on ontologies. 2004. [10] M. El-Kebir, J. Heringa, and G. W. Klau. Natalie 2.0: Sparse global network alignment as a special case of quadratic assignment. Algorithms, 8(4), 2015. [11] J. Flannick, A. Novak, B. S. Srinivasan, H. H. McAdams, and S. Batzoglou. Graemlin: general and robust alignment of multiple large interaction networks. Genome research, 16(9), 2006. [12] O. Goga, P. Loiseau, R. Sommer, R. Teixeira, and K. P. Gummadi. On the reliability of profile matching across large online social networks. In KDD, 2015. [13] E. Jim´enez-Ruiz, B. C. Grau, Y. Zhou, and I. Horrocks. Large-scale interactive ontology matching: Algorithms and implementation. In ECAI, 2012. [14] G. W. Klau. A new graph-based method for pairwise global network alignment. BMC Bioinformatics, 10(Suppl 1), 2009. [15] N. Korula and S. Lattanzi. An efficient reconciliation algorithm for social networks. In VLDB, volume 7, pages 377–388, 2014. [16] D. Laming. Human judgment: the eye of the beholder. Cengage Learning EMEA, 2003. [17] S. A. Macskassy. Using graph-based metrics with empirical risk minimization to speed up active learning on networked data. In KDD, 2009. [18] M. Magnani, B. Micenkova, and L. Rossi. Combinatorial analysis of multiple networks. arXiv:1303.4986, 2013. [19] M. L. Miller, H. S. Stone, and I. J. Cox. Optimizing murty’s ranked assignment method. IEEE Transactions on Aerospace and Electronic Systems, 33(3), 1997. [20] K. G. Murty. Letter to the editoran algorithm for ranking all the assignments in order of increasing cost. Operations Research, 16(3):682–687, 1968. [21] C. Sarasua, E. Simperl, and N. F. Noy. Crowdmap: Crowdsourcing ontology alignment with microtasks. In ISWC, 2012. [22] B. Settles. Active learning literature survey. Technical report, University of Wisconsin–Madison, 2010. [23] F. Shi, J. Li, J. Tang, G. Xie, and H. Li. Actively learning ontology matching via user interaction. In ISWC, 2009. [24] P. Shvaiko and J. Euzenat. Ontology matching: state of the art and future challenges. TKDE, 25(1), 2013. [25] R. Singh, J. Xu, and B. Berger. Global alignment of multiple protein interaction networks with application to functional orthology detection. PNAS, 105(35), 2008. [26] W. E. Winkler. String comparator metrics and enhanced decision rules in the fellegi–sunter model of record linkage. In Proceedings of the Section on Survey Research Methods, 1990. [27] J. Zhang and P. S. Yu. Multiple anonymized social networks alignment. In ICDM, 2015. [28] S. Zhang and H. Tong. Final: Fast attributed network

alignment. In KDD, 2016. [29] X. Zhu, J. Lafferty, and Z. Ghahramani. Combining active learning and semi-supervised learning using gaus-

sian fields and harmonic functions. In ICML Workshop, 2003.