An approach to quantify integration quality using feedback on mapping results

Fernando R.S. Serrano (University of Manchester, Manchester, UK)
Alvaro A.A. Fernandes (University of Manchester, Manchester, UK)
Klitos Christodoulou (Neapolis University, Pafos, Cyprus)

International Journal of Web Information Systems

ISSN: 1744-0084

Article publication date: 31 December 2018

Issue publication date: 7 March 2019

Abstract

Purpose

The pay-as-you-go approach to data integration aims to reduce the time and effort required by proposing a bootstrap phase in which algorithms, rather than experts, identify semantic correspondences and generate the mappings. This highly automated bootstrap phase is likely to be of low quality, thus pay-as-you-go approaches postulate a subsequent continuous improvement phase based on user feedback assimilation to improve the quality of the integration. The purpose of this paper is to quantify the quality of a speculative integration, using one particular type of feedback, mapping results, whilst taking into account the uncertainty of user feedback provided.

Design/methodology/approach

The authors propose a systematic approach to quantify the quality of an integration as a conditional probability given the trustworthiness of the workers. Given a set of mappings and a set of workers of unknown trustworthiness, feedback instances are collected in the extents of the mappings that characterize the integration. Taking into account the available evidence obtained from worker feedback, the technique provides a quality quantification of the speculative integration.

Findings

Experimental results on both synthetic and real-world scenarios provide valuable empirical evidence that the technique produces a cost-effective quantification of integration quality that faithfully reflects the judgement of the workers whilst taking into account the inherent uncertainty of user feedback.

Originality/value

Current pay-as-you-go techniques provide a limited view of the integration quality as the result of feedback assimilation. To the best of the authors’ knowledge, this is the first proposal for quantifying integration quality in a systematic and principled manner using mapping results as a piece of evidence while at the same time considering the uncertainty inherited from user feedback.

Keywords

Citation

Serrano, F.R.S., Fernandes, A.A.A. and Christodoulou, K. (2019), "An approach to quantify integration quality using feedback on mapping results", International Journal of Web Information Systems, Vol. 15 No. 1, pp. 47-70. https://doi.org/10.1108/IJWIS-05-2018-0043

Publisher

:

Emerald Publishing Limited

Copyright © 2018, Emerald Publishing Limited


1. Introduction

Recent trends, such as the evolution of the traditional Web of Documents into a Web of Data (Heath and Bizer, 2011), and the continued emergence of so-called big data have led to an unprecedented number of data sources becoming available for potential use by individuals and organizations. To make the most of this, given that such sources are inherently semantically heterogeneous, a range of techniques are needed to tackle the problem of integrating them, i.e. allowing access to them as if they had been designed with a semantically uniform purpose. Indeed, many years of research activity have delivered many effective techniques for the most important steps in the data integration process (Doan et al., 2012) for a comprehensive account).

However, the fact that data sources are becoming not only more numerous, but more diverse and more dynamic poses significant challenges to traditional data integration techniques. Setting up and maintaining an integrated resource over highly heterogeneous, distributed data sources that lack explicit, stable schemas and are highly dynamic (e.g. Web sources) is a difficult, effort- and time-intensive task (Madhavan et al., 2007; Paton et al., 2012).

Over time, the realization grew among researchers that, for the kinds of data sources that are most often available and have the most interest, traditional data integration was not cost-effective, raising a problem that is likely to become acuter with time (Paton et al., 2016). In response, a new approach has emerged that centers on the notion of a dataspace (Halevy et al., 2006). Dataspace management systems (Hedeler et al., 2012) consist of tools and techniques for automating the initial set-up of integrated resources that are subsequently continuously improved through user feedback. The approach is often referred to as pay-as-you-go data integration because the ethos is that, while an initial, automated integration may not be high-quality, it is produced at low cost and quickly, thereby allowing it to be used more rapidly than would otherwise be the case. This enables and encourages users to provide feedback and reap the benefits of such investment in the form of a continuously improving integration.

The pay-as-you-go approach to data integration is now widely accepted as suitable to apply in integration scenarios consisting of highly heterogeneous, distributed data sources that are potentially schema-unstable and volatile, of which many big data sources are exemplars. Pay-as-you-go data integration enables on-demand integration with reduced up-front costs (Hedeler et al., 2009). In brief, this is made possible because in this approach one engages in a bootstrapping phase relying heavily on automated processes to perform such tasks as data extraction (Furche et al., 2014), to gain access to syntactically uniform data; matching, to identify semantic correspondences (Aumueller et al., 2005); and mapping generation (Marnette et al., 2011), to enable access to sources via an integrated, global view. In this paper, we refer to the outcome of bootstrapping as an integration, i.e. a set of mappings that allow one to query an integrated resource (referred to as the target) over a collection of heterogeneous, autonomous, distributed data sources. As the integration resulting from this highly automated bootstrapping phase is likely to be of less than ideal quality (Sarma et al., 2008,1), pay-as-you-go approaches postulate a subsequent continuous improvement phase based on user feedback assimilation (Belhajjame et al., 2011; Jeffery et al., 2008).

This process of feedback assimilation still poses several challenges:

  • it could be costly if it requires too many feedback instances from too many users;

  • it could be slow to deliver quality if, as a result of not making the most of currently collected feedback, users are often required to provide more feedback instances; and

  • it could be hindered by the challenge of quantifying integration quality in a principled, cost-effective manner; it is on this last challenge that we focus in this paper.

Current pay-as-you-go techniques only provide a limited view of integration quality as the result of feedback assimilation. As we show in detail in Section 4, prior proposals have come up with formalisms for modeling uncertainty but with a narrow focus, e.g. in respect of matching feedback only. This paper addresses the specific challenge of quantifying integration quality in a principled, cost-effective manner and, in particular, contributes:

  • a cost-effective, sampling-based proposal for categorizing individual mappings with respect to their quality on the basis of feedback on mapping results and the trustworthiness of feedback providers;

  • a detailed proposal for deriving from such a categorization a quantification of the quality of an integration, seen as a set of mappings, in the form of a conditional probability; and

  • an empirical evaluation on both synthetic and real-world integration scenarios of the cost-effectiveness of our quantification technique.

Briefly, the proposed technique uses the testimony of users on whether individual tuples that actually appear in the result should or should not do so. The technique deals with the problem of constructing a ground truth by taking the latter as revealed by majority voting, which also allows us to characterize the trustworthiness of feedback providers.

The remainder of the paper is structured as follows. Section 2 is a detailed description of our main contribution, with a running example being used to illustrate each step in the construction. Section 3 describes the results of the empirical evaluation of our contribution. Section 4 discusses related work. Section 5 summarizes our conclusions.

2. Measuring integration quality

2.1 Problem definition

An integration is a set of mappings. In the pay-as-you-go approach, these are initially derived using automated techniques. Queries against the consequent integrated resource are answered using the mappings (e.g. as views) over the sources that supply the actual data. The problem we tackle in this paper is that of assigning a measure of quality to such an integration. More formally, we define this problem as follows:

Given an integration 𝕄 as a set of mappings and a set W of workers of unknown trustworthiness, collect, from workers in W, instances of feedback on tuples in the extents of the mappings in 𝕄 and, by construing positive (resp., negative) feedback as positive (resp., negative) evidence of the correctness of the corresponding mapping, assign to 𝕄 a quality measure in correspondence with the judgments of the workers in W.

2.2 Approach to solution

Our solution can be broadly described as follows. Firstly, we take the collection of mappings (possibly, a sample thereof) in the integration and we sample the tuples in the extents of the latter. We then collect feedback from workers on whether a tuple is expected (i.e. whether they concur that the tuple belongs in the result). In doing so, we use replication so as to contend with variable worker-trustworthiness. Thus, feedback on a given tuple is asked of k workers (where k, is the replication factor) and the majority vote is taken as the ground truth as to whether that tuple is indeed expected. Then, using such ground truths, we build a confusion matrix that enable us to derive quality indicators over the sampled extent of each mapping in the integration. We then use such indicators to derive event counts (the number of times the proposition that a given mapping is good holds, and the number of times the proposition that workers are trustworthy in their judgments holds). Finally, such event counts allow us to derive the conditional probability that the integration is good[1] given the evidence obtained from worker feedback. We now explain these steps in detail (as depicted in Figure 1), accompanying them with a running example to facilitate understanding.

Sampling mapping extents: Given a collection of mappings M βŠ† 𝕄, for each mapping m ∈ M, we evaluate m and select a sample ⟦m⟧ of tuples in its extent.

Collecting feedback using replication: Given a set W of workers and a degree of replication k > 1, k mod 2 = 1, we collect k feedback instances on each tuple t ∈ ⟦m⟧ from k different workers in W. A feedback instance is either positive, denoted by β€œ+”, or negative, denoted by β€œβˆ’β€. Positive (resp., negative) feedback on a tuple is construed as a judgment by the corresponding worker that t belongs, or is expected to occur, (resp., does not belong, or is not expected to occur) in the extent of m.

Consider the example mapping m1 ∈ M in Table I, where the leftmost column shows a sample of three tuples from its extent ⟦m⟧ = {t1,t2,t3}. The next three columns to the right show that, in this example, k = 3, and, therefore, each tuple was sent to three different workers for feedback. The table shows the feedback sequences that might have been returned for each tuple (e.g. [+, βˆ’, +] for t2). We now explain what the remaining columns denote and how they are derived.

Deriving majority-based ground truths: Given a k-long feedback sequence S(t), drawn from the domain {+, βˆ’}, for each tuple t = ⟦m⟧, we define the majority vote for a tuple t as follows: MV(t) = + if the count of + in S(t) is larger that the count of βˆ’ in S(t) and MV(t) = βˆ’, otherwise. Henceforth, we construe MV(t) as the ground truth judgment on t.

Consider again the example in Table I. The MV column reflects the definitions above (e.g. t2 has two positive and one negative vote, so MV(t2) = +, whereas t3 has one positive and two negative votes, so MV(t3) = βˆ’).

Deriving confusion matrices from ground truths: We use the derived ground truth for each tuple in the sampled extent ⟦m⟧ to derive a confusion matrix CM(m) over the sampled extent of every mapping in M.

Recall that, in a Boolean setting such as ours, a confusion matrix is a 2 Γ— 2 matrix where the four cells denote the observed counts of true and false positives, and of true and false negatives, defined as follows. We denote by TP (for true positives) the count of elements in S(t) that have the value + when MV(t) = +. Analogously, we denote by FN (for false negatives) the count of elements in S(t) that have the value βˆ’ when MV(t) = +, by TN (for true negatives) the count of elements in S(t) that have the value βˆ’ when MV(t) = βˆ’, by FP (for false positives) the count of elements in S(t) that have the value + when MV(t) = βˆ’. Note that, in Table I, the 2 Γ— 2 confusion matrix is linearized into four columns corresponding to each one of the four cells (as shown in Table II).

Consider again the example in Table I. The TP, FN, TN, and FP columns reflect the definitions above (e.g. for t2, the ground truth is positive (MV(t) = +) and there are two positive elements and one negative element, hence, for t2, TP = 2 and FN = 1; and for t3, the ground truth is negative (MV(t) = βˆ’) and there are two negative elements and one positive element, hence, for t3, TN = 2 and FP = 1). We then sum the counts obtained for each feedback sequence S(t) into its corresponding aggregate for m. These aggregates are shown in the last line in Table I. We now explain how we use them.

Deriving metrics from CM counts: We view the workers that are asked to give feedback as, collectively, exhibiting the behavior of a Boolean classification model in the sense that each worker makes independent predictions on the true label for a set of tuples {t1,…,tn} ∈ ⟦m⟧. As such, a confusion matrix aggregates the number of correct/incorrect predictions in relation to the majority vote (i.e. the ground truth). In this context, a confusion matrix allows us to compute measures that reflect the overall performance of the workers that provide feedback on the tuples in a mapping extent. We adopt the same notion of a confusion matrix as the one presented in Fawcett (2006). These measures can be used to assess the trustworthiness of the workers that are providing feedback on those tuples. We denote by Wm, the set of workers {w1, …, wn} ∈ W that provide feedback on tuples in ⟦m⟧.

In particular, we use the following measures:

  • the positive predicted value, a.k.a. precision, is computed as:

    PPV(Wm)=TP(Wm)/(TP(Wm)+FP(Wm))

and denotes the fraction of tuples annotated as positives that are actually true, and corresponds, in our setting, to an answer to the following question:

Q1.

When workers predict tuples as correct, how often are they actually correct?[2]

  • the true positive rate, a.k.a. recall, is computed as:

    TPR(Wm)=TP(Wm)/(TP(Wm)+FN(Wm))

and denotes the fraction of tuples annotated as positives that are correctly identified as such, and corresponds, in our setting, to an answer to the following question:

Q2.

When tuples are correct, how often do workers predict them as correct?

  • the negative predicted value is computed as:

    NPV(Wm)=TN(Wm)/(FN(Wm)+TN(Wm))

and denotes the fraction of tuples annotated as negatives that are actually false, and corresponds, in our setting, to an answer to the following question:

Q3.

When workers predict tuples as incorrect, how often are they actually incorrect?

  • the false positive rate is computed as:

    TNR(Wm)=TN(Wm)/(FP(Wm)+TN(Wm))

and denotes the fraction of tuples annotated as negatives that are correctly identified as such, and corresponds, in our setting, to an answer to the following question:

Q4.

When tuples are incorrect, how often do workers predict them as incorrect?

  • the harmonic mean of PPV and TPR is computed as:

    f+(Wm)=2.((PPV(Wm).TPR(Wm))/(PPV(Wm)+TPR(Wm)))

and referred to as the positive f-measure, with the intuition that the higher the f+, the larger the proportion of workers that are mostly trustworthy at identifying most of the correct tuples; and

  • the harmonic mean of NPV and FPR is computed as:

    fβˆ’(Wm)=2.((NPV(Wm).TNR(Wm))/(NPV(Wm)+TNR(Wm)))

and referred to as the negative f-measure, with the intuition that the higher the f-, the larger the proportion of workers that are mostly trustworthy at identifying most of the incorrect tuples.

Note that perfect workers would provide feedback for a mapping m such that f+ = fβˆ’ = 1.

From TP = 5, FN = 1, TN = 2 and FP = 1 in Table I, we compute PPV(Wm)=56=0.83, TPR(Wm)=56=0.83, NPV(Wm)=23=0.66, TNR(Wm)=23=0.66, f+(Wm)=2Γ—0.83Γ—0.830.83+0.83=0.83, fβˆ’(Wm)=2Γ—0.66Γ—0.660.66+0.66=0.66.

Deriving an event count matrix: The CM-derived measures just described are then used for deriving an event count matrix ECM (𝕄) for the integration.

In building the ECM, we consider co-occurrences of events. Some relate to mapping quality, the others relate to the trustworthiness of the workers, as follows. The first (resp., second) event type is of the form β€œm is good” is true (resp., false), which we denote by H (resp., HΜ„). The third (resp., fourth) event type is of the form β€œworkers are trustworthy” is true (resp., false), which we denote by Ξ΅ (resp., Ξ΅Μ„). As there are four event types and we are interested in their co-occurrence, the ECM is a 2 Γ— 2-matrix, whose structure (including per-column and per-row summations) is shown in Table III.

Firstly, note that the evidence accumulated in the confusion matrix CM(m) for a mapping m results in the increment, by one, of one, and only one, cell of ECM(M). Secondly, note that, in this paper, we do not quantify the quality of individual mappings. Instead, for each mapping m we compute the ECM cell that is to be incremented given its confusion matrix CM(m), as such cell denotes a true/false conjunctive statement as to the quality of m and the trustworthiness of the workers that provided feedback on tuples in its extent.

We now describe the event definitions that allow the counts in the cells in an ECM to be computed. Then, we show how, from such counts, we can quantify integration quality as a probability conditioned on the feedback provided.

The proposition H(m) ≑ β€œm is good” is true (equivalently, holds) when the count of true tuples in CM(m) (w.r.t. ground truth) is larger than the count of false tuples (w.r.t. ground truth), and H(m) ≑ β€œm is good” is false, otherwise. More formally[3]:

(1) H(m)={Tif (TP+FN)β‰₯(FP+TN)Fotherwise

Then, w.r.t. the cells in Table III, and given equation (1), we compute the following sets:

(2) H={m|H(m)=T,m∈M}
(3) H¯={m|H(m)=F,m∈M}

A trustworthy worker is one that correctly judges the tuples given for feedback, i.e. the worker judges a tuple as positive when it is correct (with respect to the ground truth), and negative when it is incorrect (with respect to the ground truth). Here, we differentiate this ability in respect of when the mapping is good and when the mapping is not good. The intuition is that a worker may behave differently when judging positive to when judging negative tuples. For example, a worker may be better at judging positive than negative tuples.

Informally, workers are trustworthy on good mappings if they have high performance, construed as an above-average f+, over good mappings, i.e. in mostly judging true tuples as positive (i.e. high PPV) and in judging most true tuples as positive (i.e. high TPR). Informally, workers are trustworthy on bad mappings if they have high performance, construed as an above-average f-, over bad mappings, i.e. in mostly annotating false tuples as negative (i.e. high NPV) and in annotating most false tuples as negative (i.e. high TNR). Correspondingly, workers are not trustworthy on good (resp., bad) mappings if they do not have high performance, construed as below or average f+, over good (resp., bad) mappings.

More formally, given equations (2) and (3), for the purposes of populating an ECM with the structure depicted in Table III, to count whether propositions on worker trustworthiness are true (equivalently, hold) we first compute the following sets:

(4) E m > = { m | f + ( W m ) > a v g ( { f + ( W m β€² ) | m β€² ∈ H } ) }
(5) E Β― m > = { m | f βˆ’ ( W m ) > a v g ( { f βˆ’ ( W m β€² ) | m β€² ∈ H Β― } ) }
(6) E m < = { m | f + ( W m ) < = a v g ( { f + ( W m β€² ) | m β€² ∈ H } ) }
(7) E Β― m < = { m | f βˆ’ ( W m ) < = a v g ( { f βˆ’ ( W m β€² ) | m β€² ∈ H Β― } ) }

Now, given equations (4)-(7), we compute the following sets:

(8) Ξ΅={Em>βˆͺEΒ―m>}
(9) Ρ¯={Em<βˆͺEΒ―m<}

All the cells in the ECM for an integration can be computed from the cardinalities of the sets in equations (2)-(3), (8) and (9) or their intersections, as defined in Table III. For example, for a mapping m, one would add one to the cell (H ∩ Ξ΅) if the mapping m is good and the workers providing feedback for mapping m are trustworthy. For another example, one would add one to the cell (H ∩ Ξ΅Μ„) if the mapping m is good and the workers providing feedback for mapping m are not trustworthy.

To better show how an ECM is derived, we add to the mapping m1 in Table I, four other mappings, m2 to m5. We show in Table IV the feedback collected, the majority votes and the CM cell values for each of the five mappings. Then, in Table V, we show the CM-derived measures for the example mappings.

Having (a) collected feedback on each mapping, (b) computed the corresponding CMs, and (c) derived the relevant CM measures, we can emit the ECM cell for each mapping.

First, we need to discriminate each mapping as good (H) or not (HΜ„) using equation (1). Table VI shows the outcome for our running example. From this table, we have that H = {m1, m2, m3} and HΜ„ = {m4,m5}.

Second, given Tables V and VI, we can determine whether workers are trustworthy or not for good and for bad mappings using equations (4)-(7). To do so, we compute the average f+ for good mappings (those in H) and the average f- for bad mappings (those in HΜ„). For the running example, we have:

avg[f+(m1),f+(m2),f+(m3)=avg[0.83,0.72,0.82]=0.78

and

avg[fβˆ’(m4),fβˆ’(m5)=avg[0.72,0.87]=0.79

Given the average f+ and the average fβˆ’, when determining whether workers were trustworthy, we use the former for good mappings and the latter for bad mappings. For the running example, this is shown in Table VII[4]. Thus, m1, which is a good mapping, has an f+ of 0.83, larger than the average f+ for good mappings of 0.78, and we, therefore, conclude that the workers were trustworthy in providing feedback for m1. Correspondingly, the workers were not trustworthy in providing feedback for m2, also a good mapping. Note also that m4, a bad mapping, has an fβˆ’ of 0.72, smaller than the average fβˆ’ for bad mappings of 0.79, and therefore the workers were not trustworthy in providing feedback for m4. Correspondingly, the workers were trustworthy in providing feedback for m5, also a bad mapping.

Given Tables VI and VII (i.e. having determined which mappings are good and which are not, and having also determined whether workers were trustworthy when providing feedback for good and for bad mappings), we compute the ECM cell that each mapping increments by observing its membership in the intersection of the relevant paired events, as shown in Table VIII.

Finally, we obtain the relevant counts to populate ECM(M) as in Table IX.

Quantifying Integration Quality. Given ECM(M), we quantify the quality of the integration 𝕄 = {m1, …, m5}, given the feedback provided on the extents of the mappings that comprise it, using the following form of Bayes’ theorem:

(10) P(H|Ξ΅)=P(Ξ΅|H)P(H)P(Ξ΅)

where P(H) is the probability that the integration 𝕄 is good, P(Ξ΅) is the probability that the workers are trustworthy, P(Ξ΅|H) is the likelihood that the workers are trustworthy given that the integration is good, and P(H | Ξ΅) is the probability that the integration is good given that the workers are trustworthy.

To conclude our running example, given Table IX and equation (10), we quantify the integration quality as a conditional probability, as follows:

P(H|Ξ΅)=0.60Γ—0.660.60=0.66

This completes the description of our main contribution. In the above, we have not addressed the issue that, in realistic pay-as-you-go integration scenarios, feedback collection is limited by a budget and it is desirable to have the ability to collect feedback incrementally. We now show how our approach lends itself to estimating the quality of an integration in an incremental fashion.

Incremental Feedback Assimilation. The technique contributed in this paper, as described above, is inherently incremental in the sense that one can, in broad terms, (a) take a sample of the mappings and then a sample of the tuples in the extents of those mappings, (b) collect feedback on such tuples, (c) quantify the quality of the integration and (d) decide, budget permitting, whether to increase one or both of the samples and perform (b)-(d) iteratively, incrementing the sample size at each iteration.

For simplicity, in the running example, rather than taking samples, we have worked with all the mappings and all the tuples from each mapping extent. For larger integrations and larger mapping extents, one might prefer to proceed incrementally. This is reflected in the following modified problem:

Given an integration 𝕄 as a set of mappings and a set W of workers of unknown trustworthiness, collect, from workers in W, in steps of size (for the mappings) and ts (for the tuples), up to b instances of feedback on tuples in the extents of the mappings in 𝕄 and, by construing positive (resp., negative) feedback as positive (resp., negative) evidence of the correctness of the corresponding mapping, assign to 𝕄 a quality measure in correspondence with the judgments of the workers in W.

Incrementally assimilating a larger or smaller number of feedback instances potentially changes the per-mapping CMs and, consequently, the per-integration ECM. For example, a larger number of positive (or negative) feedback instances could change the majority voting for a tuple t ∈ ⟦m⟧, which, in turn, could change the number of true positives and false negatives in CM(m), perhaps giving rise to a different classification for the mapping m. This would change the ECM and hence the quantification of integration quality. As expected, the more feedback is collected, the more accurate the quantification is. In this modified problem, at each iteration, the sample of tuples grows by the given step size of tm until the given budget b is reached or else the quantification is considered accurate enough (e.g. it has stabilized at a certain value) as judged by some metric (e.g. variance).

Algorithm 1 illustrates the general process of assimilating feedback on mappings results for the purposes of quantifying integration quality. The algorithm takes as input: a set of mappings from an integration 𝕄, a set of workers W, a replication factor k, a mapping-sample increment tM indicating the percentage of mappings to annotate on each iteration, a tuple-sample increment ts indicating the percentage of tuples to annotate on each iteration, a budget b indicating the maximum number of feedback instances to collect (alternatively, one might map this to a monetary amount), and a replication factor k. The output of the algorithm is the estimated integration quality as a conditional probability.

The algorithm iterates unless the budget b has been spent (line 1) or there are no unannotated tuples (lines 2-5). The main loop of the algorithm starts by taking the set of mappings that have unannotated tuples in their extents and selecting tM mappings on this set to yield the sample of mappings for which feedback is to be collected in this iteration (lines 2-3). Then, for each mapping, we take a sample of size ts from its unannotated extent (lines 8-9). For each tuple t, we collect k instances of feedback from the set of workers W, and update the confusion matrix for the mapping m to which t belongs and, from the latter, we compute the cell in the event count matrix ECM(M) for the integration that m increments (lines 11-15), whilst making sure that we adjust downwards the set of tuples available for feedback collection on the extent of m (line 12) and the budget b (line 19). We can emit the integration quality given the feedback assimilated so far at each iteration (lines 2-20) either after a tuple-sample step (after line 17) or, more laconically, after a mapping-sample step (line 20), of which there may be only one, if we take all the mappings from the start.

Algorithm 1 Incremental Feedback Assimilation

Input: set of mappings 𝕄

Input: set of workers W

Input: mapping increment tM

Input: tuple increment ts

Input: budget b

Input: replication factor k

Output: integration quality (as a probability value)

1 while b > k Γ— ts Γ— tM do

2  M ← {m ∈ 𝕄 | m has unannotated tuples}

3  M ← select(tM, M)

4  if M = Ø then

5   break

6 else

7  for m ∈ M do

8   T ← {t ∈ ⟦m⟧ | t is unannotated}

9   T ← select(ts, T)

10  for t ∈ T do

11   F ← collect Feedback (k, t , W)

12  mark t as annotated

13  update(CM (m), F)

14  update(ECM (M), CM(m)

15  derive (P(H | Ξ΅), ECM(M)

16  end

17 end//if needed, emit P(H | Ξ΅) here

18  end

19 b ← b βˆ’ (k Γ— ts Γ— tM)

20 emit P(H | Ξ΅)

21 end

3. Experimental evaluation

This section reports on the results of an experimental evaluation of our main contributions on both synthetic and real-world scenarios. The main experimental goal is to assess:

  1. whether our approach is effective, i.e. whether it produces a quantification of integration quality that reflects the quality of the component mappings as decided by worker feedback on mapping results, and

  2. whether our approach is cost-effective, i.e. whether an affirmative answer to (1) is obtained with a number of feedback instances that is a low percentage of the maximum amount possible.

It is crucial to note, as the experimental results below make clear, that the technique contributed in this paper is highly dependent on the trustworthiness of the workers, in the sense that a high-quality integration that is annotated by untrustworthy workers will be computed to be low quality. As such, and as usual in real applications, worker selection prior to feedback collection is a necessity.

3.1 Experimental setup

The algorithms have been implemented in Python. Experiments were performed on an Intel Core i5 3.20 GHz Γ— 4 and 8 GB RAM.

We use synthetic scenarios to explore the space of possibilities regarding mapping quality and worker trustworthiness. We use real-world scenarios to ascertain that the technique works well for realistic integrations. In this paper, we do not report on experiments done with real workers, which is left for future work. Instead, we model worker trustworthiness with a binomial probability model. In what follows, we describe the two kinds of scenario in some detail.

Synthetic scenarios. We synthesize an integration scenario 𝕄 by synthesizing a set of mappings, where each m ∈ 𝕄 is associated with a random number of synthetic result tuples. We induce a degree of mapping quality by generating different fractions of correct and incorrect tuples for mapping extents. The synthetic generator takes as input the number of mappings to synthesize, and, per mapping, its maximum cardinality and the fraction of correct/incorrect tuples it produces. By varying these parameters, we can generate different integration scenarios characterized by the different quality of its component mappings. In particular, we have generated three synthetic scenarios:

  1. mostly good mappings where a great many mappings, viz., β‰ˆ 80 per cent, produce at least β‰ˆ 80 per cent correct tuples.

  2. mostly medium-quality mappings where a great many mappings, viz., β‰ˆ 80 per cent, produce β‰ˆ 50 per cent correct tuples.

  3. mostly bad mappings where a great many mappings, viz., β‰ˆ 80 per cent, produce at most β‰ˆ 20 per cent correct tuples.

We have also synthesized sets of workers, where each worker is associated with a binomial distribution B(n, p), where n is the number of trials and p is the probability of success, so as to model the tendency of a worker to provide, or not, true answers. If p = 1, then the worker always provides true answers. In particular, we have synthesized three sets of workers:

  1. mostly good workers where a great many workers, viz., β‰ˆ 80 per cent, display high (p β‰ˆ 0.8) trustworthiness in giving feedback (i.e. they accurately report on whether a tuple belongs or not in a mapping result).

  2. mostly medium workers where a great many workers, viz., β‰ˆ 80 per cent, display neither high nor low (p β‰ˆ 0.5) trustworthiness in giving feedback.

  3. mostly bad workers where a great many workers, viz., β‰ˆ 80 per cent, display low (p β‰ˆ 0.2) trustworthiness in giving feedback.

Real-world scenarios. We have used the DSToolkit (Hedeler et al., 2012) dataspace management system to integrate two real-world data sets from the music domain, viz., DBTune and Magnatune. Their schemas are shown in Figure 2 with semantic correspondences indicated by solid lines. Table X shows one example of a good mapping (m1) and one example of a bad mapping (m2) in the resulting integration. The quality of the mappings is related to the correct identification of paired schema elements that are semantically equivalent, i.e. represent the same real-world object. For example, m1 in Table X, correctly identifies DBTune.artist and Magnatune.MusicArtist as they describe the same concept in a music domain. In contrast, m2 mistakenly identifies DBTune.record and Magnatune.Track as they describe different objects, even though they shared some equivalent attributes.

The bootstrap phase of DSToolkit generated an integration comprising 29 mappings, of which (after human expert analysis) 10 were deemed of good quality (understood as producing mostly correct tuples) and 19 were deemed of bad quality. To represent real-world scenarios of diverse quality, we selected two distinct subsets of the DSToolkit-generated integration, thereby giving rise to two distinct integrations for us to experiment with, namely, I1, with mostly good mappings (specifically, ten good and three bad), and I2, with mostly bad mappings (specifically, ten bad and three good)[5].

Sampling. Given an integration 𝕄, we use a random sampling distribution to obtain from it a subset M of mappings and, from the result of evaluating a mapping m ∈ M, a subset ⟦m⟧ of the tuples on which feedback is collected. For experimental purposes, one needs to set the increment ts on the previous sample. We set ts = 10 in what follows.

Replication factor. One also needs to set the replication factor k. Recall that each tuple in the sample returns k feedback instances from k different workers, and therefore each tuple discounts the available budget by k times. We set k = 3 in what follows, therefore, if we collected feedback on five tuples, we would be spending 15 feedback instances from the budget at each step.

3.2 Experimental results

In this paper, an experiment has the following input variables:

  • the integration scenario, i.e. set of mappings; and

  • the worker trustworthiness distribution composition (mostly good – mostly medium – mostly bad), as well as the the replication factor and the tuple-sample increment step.

We measure the integration quality given by equation (10) above (notated as P(H|E) in the plots that follow) as the sample size increases.

Experiment 1: Effectiveness on Synthetic Integrations. In this experiment, we use two synthetic integration scenarios. Our goal is to evaluate the effectiveness of our technique, i.e. how well the quantification corresponds to the quality of the integration for different levels of worker trustworthiness.

Firstly, we use an integration where mappings are mostly good, i.e. most of them mostly produce correct tuples, and, for three levels of trustworthiness (mostly good, mostly medium and mostly bad workers), we measure the integration quality as the sample size increases. The corresponding curves are shown in Figure 3. Then, we use an integration where mappings are mostly bad, i.e. most of them mostly produce incorrect tuples, with the corresponding results shown in Figure 4.

Figures 3 and 4 provide empirical evidence that our technique is effective, insofar as it produces a quantification of integration quality that faithfully reflects the judgment of the workers.

Thus, for an integration consisting of mostly good mappings, when workers are mostly good (i.e. the solid line in Figure 3), a sample size of 10 per cent already allows the quality to be measured accurately with respect to the true population value (i.e. at sample size of 100 per cent). Conversely, when workers are mostly bad, the consensus is the opposite of what prior expectations might suggest and the quantification reflects this reversal of expectations by plummeting (i.e. the dotted line in Figure 3). Finally, if workers are mostly of medium quality (i.e. they cannot be said conclusively to be trustworthy or not), the resulting lack of consensus in their judgments is reflected by the quantification hovering around the indifference level, irrespective of the sample size (i.e. the dashed line in Figure 3). Corresponding observations can be made with reference to Figure 4 for an integration consisting of mostly bad mappings.

Experiment 2: Effectiveness on Realistic Integrations. In this experiment, we use real-world, rather than synthetic, integrations. Our goal is still to evaluate the effectiveness of our technique but now only the workers are synthesized. The experiment has the same design as Experiment 1 above but the quality of the mappings (and hence of the resulting integration) is the result of bootstrapping with DSToolkit, i.e. we do not control the number of correct and incorrect tuples in the results. We do, as explained above, create two distinct integrations by mapping selection, viz., I1, with mostly good mappings (specifically, ten good and three bad), and I2, with mostly bad mappings (specifically, ten bad and three good).

Figures 5 and 6 confirm the empirical evidence from Figures 3 and 4 that our technique is effective insofar as the same properties can be observed regarding the faithful reflection of worker judgments.

Analogous observations can be made for Figures 5 and 6, regarding real-world scenarios, as were made for Figures 3 and 4, regarding synthetic scenarios. Again, we note that the quantification faithfully reflects the view of the workers, and is, correspondingly, sensitive to their profile regarding their being trustworthy or not.

On Cost-effectiveness. The results plotted in Figures 3 to 6 also provide evidence that our technique is cost-effective in the sense that it does not depend for its accuracy on large numbers of feedback instances nor is it, above a small value, overly sensitive to the number of feedback instances collected.

Table XI shows the integration quality (i.e. P(H|E) per sample size when workers are mostly good for the four scenarios in Figures 3 to 6, along with the mean and standard deviation over the incremental sample sizes.

We observe that the standard deviation for all scenarios is very close to zero indicating that, above a small value (on which, more later), the quantification is insensitive to the number of instances collected. This means that, in practice, collecting a relatively small number of feedback instances provides a reliable estimate of integration quality. Consistently with this, we note that the errors with a sample size of 10 per cent (defined as the measured quality subtracted from the true quality, measured at 100 per cent) are (0.001, 0.020, 0.011, βˆ’0.018) for the four scenarios in Table XI.

In addition, we compute the standard error of the mean (SEM) to quantify the standard deviation of the error with respect to the expected value (mean). Figure 7 shows the SEM (as vertical lines) at 100 per cent of the sample for mostly good real-world mappings. We observe that the standard error remains close to zero (0.01) using different worker compositions. This may suggest that the integration quality estimates is consistently close to the expected value across different sample sizes. In the same guise, Figure 8 shows the SEM at 100 per cent of the sample for mostly bad real-world mappings. Again, we observe that the standard error remains close to zero (0.01) using different worker compositions for an integration consisting of mostly bad real-world mappings.

To further show the cost-effectiveness of our approach, Figures 9 and 10 depict the effect on the resulting integration quality estimates using a larger replication factor (i.e. k = 5), for mostly good and mostly bad real-world mappings, respectively. This experiment was conducted using the same set of real-world mappings, and by synthesizing the set of workers as in Experiment 2. In this scenario, we can observe the following:

  • the approach is effective in the sense that the estimation of the integration quality corresponds to the characterization of the underlying integration, i.e. a high quality estimate (e.g. x > 0.8) is revealed for an integration consisting of mostly good mappings, conversely, a low estimate (e.g. x < 0.5) is revealed for an integration consisting of mostly bad mappings,

  • the internal majority voting function does not rely on a large number of redundant votes, as to produce sensible judgments on the true value of a tuple, therefore, using the minimum replicator factor (i.e. k = 3) suffices for an accurate estimation of the integration quality.

In other words, a larger replication factor does not produce significantly better estimates (with respect to the true quality at 100 per cent of the sample). For completeness, we correlate the derived integration quality (P(H|E) using two replication factors, i.e. k = 3, k = 5 at different sample sizes. Figure 11 depicts the correlation of the resulting estimates using these factors. We observe that the derived measure using a smaller replication factor (i.e. k = 3) is slightly more pessimistic than the derived measured using a larger factor (i.e. k = 5). Comparing the two measures, we observe that there is a positive correlation between the estimates as, for most cases, a high estimate using k = 3 relates to a high estimate using k = 5. The mean absolute error (MAE) between these two estimates is 0.03, which, again, suggests that the variation in the estimate using different factors remains close to zero across different sample sizes.

To explore how few feedback instances might be needed, Figure 12 plots the integration quality estimates obtained by zooming into the (1-10 per cent) interval, in the case when workers are mostly good for scenario I1. Again, we can observe that such estimates remain close to the true quality (measured at 100 per cent). For example, the error with a sample size of 5 per cent is 0.013 (when compared to the true quality measure at 100 per cent), which seems a sensible estimation for scenarios with a limited budget and, perhaps, potentially large datasets.

4. Related work

This paper has described a systematic approach to quantifying the quality of an integration. Given the set of mappings that characterize the integration, our technique relies on judgments from workers on whether a tuple belongs or not in the extent of such a mapping. To the best of our knowledge, this is the first proposal for quantifying integration quality in a systematic and principled way. In the absence of directly comparable work, this section comments on proposals (under the pay-as-you-go paradigm for integration improvement after bootstrapping) for using feedback on individual artifacts (e.g. matching, or mapping), and proposals for uncertainty management in data integration.

Quality of Individual Artifacts. Belhajjame et al. (2011) target mapping validation and selection by means of incremental assimilation of feedback in a pay-as-you-go fashion. More specifically, end-users are explicitly asked to provide feedback by annotating returned tuples from query results generated by the mappings. In their work, feedback takes the form of true positive, false positive and false negative annotations on specific tuples returned by specific mappings. The quality of each mapping is characterized in terms of its precision and recall estimates, which are then used in both a mapping selection phase (so that, e.g. users can request that the systems answer queries with mappings that maximize precision subject to a given minimal level of recall) and a mapping refinement phase (where the estimates are used to guide a search for mappings that, taken together, constitute an integration that better meets user requirements). Our technique uses a similar form of feedback but, in contrast, derives the ground truth for a tuple by majority voting. Also, while Belhajjame et al. assign precision and recall estimates to individual mappings, we categorize (but do not quantify) the quality of individual mappings and use that to quantify the quality of the entire integration, which Belhajjame et al. do not attempt to do.

In addition, there has been significant interest in the use of crowdsourcing in data management (Crescenzi et al., 2017). In particular, Osorno-Gutierrez et al. (2013) uses a crowdsourcing platform to obtain feedback on mapping results that inform mapping selection and refinement, similarly to the work from Belhajjame et al. (2011) but using real feedback instead. Again, as in Belhajjame et al. (2011), in Osorno-Gutierrez et al. (2013), quality is only quantified for individual mappings, but, similarly to our approach, in Osorno-Gutierrez et al. (2013), there is uncertainty management, though of a different kind. In their paper, they use a replication factor to control for inconsistent feedback both by a given worker across several tasks and, as we do, by several workers on a single task.

The experimental evaluation of Osorno-Gutierrez et al. (2013) has studied only the use of crowdsourcing to collect feedback on the correctness of mapping results. In contrast to our technique, trustworthiness of workers is ensured by introducing a degree of replication to expose agreement between workers on the extents of each mapping. In our work, the mapping quality is combined with evidence from whether workers are trustworthy and comprehensive to build a probabilistic model for making judgments on the quality of a speculative suggested integration.

RΓ­os et al. (2016) deals mainly with the problem of source selection in pay-as-you-go data integration. In particular, a heuristic-based approach is used for targeting data items to be used for feedback to support mapping and selection tasks. The approach takes as input a set of sources and feedback in the form of true positive or false positive annotations on the data items they contain. The source selection process is performed by deriving estimates of precision and recall that best meets the user’s requirements. Furthermore, a cost model for efficiently sampling and collecting feedback instances to reach a desirable cut-off point has been discussed. Our approach uses a similar algorithm based on the described model in terms of targeting more informative feedback instances that may lead to the reduction of the amount of requested feedback necessary. However, our proposed techniques do not deal with source selection tasks.

Crowdsourcing and Uncertainty Management. Dong et al. (2009) propose the use of probabilistic mappings to handle uncertainty in data integration systems, such as dataspaces. Their concern encompasses uncertainty associated with the data itself and with the queries posed, and not only with the mappings. In their work, not only they do not quantify the quality of the integration as a whole, they assume that probabilities have been assigned and propose techniques to answer queries in such circumstances. In contrast, our technique can be seen as a proposal to compute the probability that an integration has good quality on the basis of evidence in the form of feedback on mapping results.

Our proposal makes the case that workers can provide feedback on the extents of a set of uncertain mappings to inform the construction of a probabilistic model used to characterize the quality of a speculative integration. On another note, uncertainty management has been intensively studied in the literature, with many approaches utilizing the new opportunities of engaging human intelligence into the process (Doan et al., 2011 for a survey).

In Zhang et al. (2013), probabilistic matches are assumed rather than computed. A methodology is proposed to crowdsource feedback whose assimilation then updates the probabilities of matches but does not address the quality of mappings or of integrations, as this paper does.

A probabilistic reasoning approach toward the integration of linked data sources is studied by Demartini et al. (2013). Moreover, a probabilistic model is used to mitigate the uncertainty introduced by arbitrary human workers on a crowdsourcing platform. We note that our incremental model for feedback assimilation depends on the size of the mapping sample, and the sampling size of their extents which directly reflects on the final quality derivation for the integration.

On another note Zhao et al. (2012) deals with the so-called truth finding problem that arises from conflicting information about the same entity that is likely to exist in diverse or conflicting data sources. In contrast to our technique that relies on the judgments from workers to quantity the quality of the integration as a whole, Zhao et al. (2012) propose a methodology for characterizing and then rank the quality of the sources to be integrated as a solution to the truth finding problem. Instead of incorporating a conditional probability as a quality measure to quantify whether the proposed integration is of a certain quality, assuming trustworthy workers, the authors leverage a graphical Bayesian probabilistic model that treats the truth as a latent random variable (LTM) to learn source quality and infer truth incrementally. Experimental evaluation justifies the effectiveness of the LTM technique as an incrementally an unsupervised technique for effective truthfinding and source quality. As a possible future direction we plan to explore the effectiveness of the proposed Bayesian probabilistic graphical model with LTMs in contrast to our methodology.

To sum up, recent proposals mostly focus on using various forms of feedback either from end users or from crowd workers to tackle, for the most part, a specific sub-problem of data integration e.g. source selection (RΓ­os et al., 2016), schema matching (Jeffery et al., 2008; Zhang et al., 2013), on keyword query results (Yan et al., 2015), mapping selection and refinement (Belhajjame et al., 2013), or interactive inference of join queries (Bonifati et al., 2014), not always indicating how to explicitly and directly quantify in a principled, cost-effective manner the quality of the artifacts they are concerned with, as this paper does.

5. Conclusions

This paper has contributed a principled, cost-effective approach to quantifying integration quality whilst taking into account the inherent uncertainty of user feedback. The technique is sample-based and uses majority voting on feedback to yield a form of ground truth for tuple membership. From such judgments a permapping confusion matrix is derived and from such confusion matrices we derive a quantification of integration quality in the form of a conditional probability. Our experimental evaluation, using synthetic and real-world integration scenarios, has produced empirical evidence for our principal claims.

Figures

Workflow of the approach

Figure 1.

Workflow of the approach

Example of Magnatune and DBTune schemas

Figure 2.

Example of Magnatune and DBTune schemas

Mostly good synthetic mappings

Figure 3.

Mostly good synthetic mappings

Mostly bad synthetic mappings

Figure 4.

Mostly bad synthetic mappings

I1: mostly good real-world mappings

Figure 5.

I1: mostly good real-world mappings

I2: mostly bad real-world mappings

Figure 6.

I2: mostly bad real-world mappings

Expected value (mean) at 100 per cent with confidence interval for mostly good real-world mappings

Figure 7.

Expected value (mean) at 100 per cent with confidence interval for mostly good real-world mappings

Expected value (mean) at 100 per cent with confidence interval for mostly bad real-world mappings

Figure 8.

Expected value (mean) at 100 per cent with confidence interval for mostly bad real-world mappings

Mostly good real-world mappings using replication factor k = 5

Figure 9.

Mostly good real-world mappings using replication factor k = 5

Mostly bad real-world mappings using replication factor k = 5

Figure 10.

Mostly bad real-world mappings using replication factor k = 5

Integration quality estimation at different sample sizes using two replication factors, k = 3 vs k = 5

Figure 11.

Integration quality estimation at different sample sizes using two replication factors, k = 3 vs k = 5

I1: zoom in for mostly good real-world mappings

Figure 12.

I1: zoom in for mostly good real-world mappings

Feedback, majority votes and confusion matrix for example mapping m1 ∈ M

Tuple Workers
w1 w2 w3 MV TP FN TN FP
t1 + + + + 3
t2 + βˆ’ + + 2 1
t3 + βˆ’ βˆ’ βˆ’ 2 1
5 1 2 1

Deriving CM(m) over the sampled extent of m1

TP = 5 FP = 1
FN = 1 TN = 2

Structure of an event count matrix

ECM Ξ΅ Ξ΅Μ„ Sum
H |H ∩ Ξ΅| |H ∩ Ξ΅Μ„| |H|
HΜ„ |HΜ„ ∩ Ξ΅| |HΜ„ ∩ Ξ΅Μ„| |HΜ„|
Sum |Ξ΅| |Ξ΅Μ„| |H| + |HΜ„| + |Ξ΅| + |Ξ΅Μ„|

Confusion matrices for mappings m1-m5

m1 Workers
Tuple w1 w2 w3 MV TP FN TN FP
t1 + + + + 3
t2 + – + + 2 1
t3 + – – – 2 1
5 1 2 1
m2 Workers
Tuple w4 w5 w6 MV TP FN TN FP
t1 – + – – 2 1
t2 + + – + 2 1
t3 – – – – 3
t4 – + + + 2 1
4 2 5 1
m3 Workers
Tuple w7 w8 w9 MV TP FN TN FP
t1 + + + + 3
t2 + + – + 2 1
t3 + – – – 2 1
t4 – + + + 2 1
7 2 2 1
m4 Workers
Tuple w7 w8 w9 MV TP FN TN FP
t1 – – + – 2 1
t2 – + – – 2 1
t3 + + – + 2 1
2 1 4 2
m5 Workers
Tuple w7 w8 w9 MV TP FN TN FP
t1 – – – – 3
t2 + – – – 2 1
t3 – + – – 2 1
0 0 7 2

CM Measures for mappings m1-m5

PPV TPR NPV TNR f+ fβˆ’
m1 0.83 0.83 0.66 0.66 0.83 0.66
m2 0.80 0.66 0.71 0.83 0.72 0.76
m3 0.87 0.77 0.50 0.66 0.82 0.56
m4 0.50 0.66 0.80 0.66 0.57 0.72
m5 0 0 1.00 0.77 0 0.87

Propositions on quality for mappings m1-m5

Mapping TP + FN FP + TN H(m)
m1 6 3 T
m2 6 6 T
m3 9 3 T
m4 3 6 F
m5 0 9 F

Propositions on trustworthiness for mappings m1-m5

Mapping f+ (Wm)|fβˆ’ (Wm) W?
avg({f+(Wmβ€²)|mβ€² ∈ HΜ„}) 0.78
m1 0.83 T
m2 0.72 F
m3 0.81 T
avg({fβˆ’(Wmβ€²)|mβ€² ∈ HΜ„}) 0.79
m4 0.72 F
m5 0.87 T

Event co-occurrences for mappings m1-m5

H(m) W? (H ∩ Ξ΅) H ∩ Ξ΅Μ„ HΜ„ ∩ Ξ΅ HΜ„ ∩ Ξ΅Μ„
m1 T T 1
m2 T F 1
m3 T T 1
m4 F F 1
m5 F T 1

Event count matrix for mappings m1-m5

ECM Ξ΅ Ξ΅Μ„ Sum
β„‹ 2 1 3
HΜ„ 1 1 2
Sum 3 2 5

Example of a good and a bad mapping

Id Target table Mapping
m1 Artist SELECT
A. name as name
A. description as description
A. homepage as homepage
FROM MusicArtist A
m2 Record SELECT
T. title as title
T. track_number as track_desc
T. created as date_created
FROM Track T

Variability and errors in quantification

Scenario type
Synthetic Real-World
Sample size (%) Good Bad Good Bad
10 0.821 0.160 0.789 0.240
20 0.806 0.160 0.800 0.256
30 0.833 0.170 0.833 0.230
40 0.829 0.170 0.883 0.222
50 0.839 0.180 0.778 0.242
60 0.833 0.180 0.796 0.232
70 0.823 0.170 0.800 0.211
80 0.813 0.180 0.805 0.222
90 0.821 0.170 0.833 0.202
100 0.822 0.180 0.800 0.222
Mean 0.824 0.170 0.806 0.228
SD 0.010 0.008 0.019 0.016

Notes

1.

This notion of what counts as good here, and hence of what counts as integration quality, is made precise below.

2.

Recall, here and elsewhere in the paper, actual correctness is decided by the majority vote.

3.

For concision, we abuse notation here insofar as we write TP when we should write TP(Wm), and, correspondingly for FP, TN and FN.

4.

Note that, in column labels, W? should be read as β€œAre the workers trustworthy?”.

5.

Note that this step is consistent with the need, in dataspace management systems, to judiciously select subsets of the mappings that comprise the integration returned by bootstrapping phase for use in answering queries (Belhajjame et al., 2013).

References

Aumueller, D., Do, H.H., Massmann, S. and Rahm, E. (2005), β€œSchema and ontology matching with COMA++”, Proceeding of SIGMOD 2005, pp. 906-908.

Belhajjame, K., Paton, N.W., Embury, S.M., Fernandes, A.A.A. and Hedeler, C. (2013), β€œIncrementally improving dataspaces based on user feedback”, Information Systems, Vol. 38 No. 5, pp. 656-687.

Belhajjame, K., Paton, N.W., Fernandes, A.A.A., Hedeler, C. and Embury, S.M. (2011), β€œUser feedback as a first class citizen in information integration systems”, Proceeding of CIDR 2011, pp. 175-183.

Bonifati, A., Ciucanu, R. and Staworko, S. (2014), β€œInteractive inference of join queries”, Proceeding of EDBT 2014, pp. 451-462.

Crescenzi, V., Fernandes, A.A.A., Merialdo, P. and Paton, N.W. (2017), β€œCrowd-sourcing for data management”, Knowledge and Information Systems, Vol. 53 No. 1.

Demartini, G., Difallah, D.E. and CudrΓ©-Mauroux, P. (2013), β€œLarge-scale linked data integration using probabilistic reasoning and crowdsourcing”, VLDB J, Vol. 22 No. 5, pp. 665-687.

Doan, A., Halevy, A. and Ives, Z. (2012), Principles of Data Integration, 1st ed., Morgan Kaufmann Publishers, San Francisco, CA.

Doan, A., Ramakrishnan, R. and Halevy, A.Y. (2011), β€œCrowdsourcing systems on the world-wide web”, Communications of the ACM, Vol. 54 No. 4, pp. 86-96.

Dong, X.L., Halevy, A.Y. and Yu, C. (2009), β€œData integration with uncertainty”, VLDB J, Vol. 18 No. 2, pp. 469-500.

Fawcett, T. (2006), β€œAn introduction to ROC analysis”, Pattern Recognition Letters, Vol. 27 No. 8, pp. 861-874.

Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C. and Wang, C. (2014), β€œDIADEM: thousands of websites to a single database”, Proceedings of the VLDB Endowment, Vol. 7 No. 14, pp. 1845-1856.

Halevy, A.Y., Franklin, M.J. and Maier, D. (2006), β€œPrinciples of dataspace systems”, Proceeding of PODS 2006, pp. 1-9.

Heath, T. and Bizer, C. (2011), Linked Data: evolving the Web into a Global Data Space, Synthesis Lectures on the Semantic Web, Morgan and Claypool Publishers, Williston, VT.

Hedeler, C., Belhajjame, K., Fernandes, A.A.A., Embury, S.M. and Paton, N.W. (2009), β€œDimensions of dataspaces”, Proceeding of BNCOD 26, pp. 55-66.

Hedeler, C., Belhajjame, K., Mao, L., Guo, C., Arundale, I., LΓ³scio, B.F., Paton, N.W., Fernandes, A.A.A. and Embury, S.M. (2012), β€œDSToolkit: an architecture for flexible dataspace management”, TLSDKCS, Vol. 5, pp. 126-157.

Jeffery, S.R., Franklin, M.J. and Halevy, A.Y. (2008), β€œPay-as-you-go user feedback for dataspace systems”, Proceeding of SIGMOD 2008, pp. 847-860.

Madhavan, J., Cohen, S., Dong, X.L., Halevy, A.Y., Jeffery, S.R., Ko, D. and Yu, C. (2007), β€œWeb-scale data integration: you can afford to pay as you go”, Proceeding of CIDR 2007, pp. 342-350.

Marnette, B., Mecca, G., Papotti, P., Raunich, S. and Santoro, D. (2011), β€œ++spicy: an opensource tool for second-generation schema mapping and data exchange”, PVLDB, Vol. 4 No. 12, pp. 1438-1441.

Osorno-Gutierrez, F., Paton, N.W. and Fernandes, A.A.A. (2013), β€œCrowd-sourcing feedback for payasyougo data integration”, Proceeding of DBCrowd 2013, pp. 32-37.

Paton, N.W., Belhajjame, K., Embury, S.M., Fernandes, A.A.A. and Maskat, R. (2016), β€œPay-as-you-go data integration: experiences and recurring themes”, Proceeding of SOFSEM 2016, pp. 81-92.

Paton, N.W., Christodoulou, K., Fernandes, A.A.A., Parsia, B. and Hedeler, C. (2012), β€œPay-as-you-go data integration for linked data: opportunities, challenges and architectures”, Proceeding of SWIM 2012, p. 3.

RΓ­os, J.C.C., Paton, N.W., Fernandes, A.A.A. and Belhajjame, K. (2016), β€œEfficient feedback collection for pay-as-you-go source selection”, Proceeding of SSDBM 2016, pp. 1:1-1:12.

Sarma, A.D., Dong, X. and Halevy, A.Y. (2008), β€œBootstrapping pay-as-you-go data integration systems”, Proceeding of SIGMOD 2008, pp. 861-874.

Yan, Z., Zheng, N., Ives, Z.G., Talukdar, P.P. and Yu, C. (2015), β€œActive learning in keyword search-based data integration”, VLDB J, Vol. 24 No. 5, pp. 611-631.

Zhang, C.J., Chen, L., Jagadish, H.V. and Cao, C.C. (2013), β€œReducing uncertainty of schema matching via crowdsourcing”, Proceedings of the VLDB Endowment, Vol. 6 No. 9, pp. 757-768.

Zhao, B., Rubinstein, B.I.P., Gemmell, J. and Han, J. (2012), β€œA bayesian approach to discovering truth from conflicting sources for data integration”, Proceedings of the VLDB Endowment, Vol. 5 No. 6, pp. 550-561.

Further reading

Sarma, A.D., Dong, X.L. and Halevy, A.Y. (2011), β€œUncertainty in data integration and dataspace support platforms”, Schema Matching and Mapping, pp. 75-108.

Serrano, F.R.S., Fernandes, A.A.A. and Christodoulou, K. (2017), β€œQuantifying integration quality using feedback on mapping results”, Proceedings of the 19th International Conference on Information Integration and Web-based Applications and Services, iiWAS 2017, Salzburg, Austria, 4-6 December, pp. 3-12.

Corresponding author

Klitos Christodoulou can be contacted at: klitos@nup.ac.cy