Data integration with uncertainty

This paper reports our first set of results on
managing uncertainty in data integration. We posit that dataintegration
systems need to handle uncertainty at three levels
and do so in a principled fashion. First, the semantic mappings
between the data sources and the mediated schema
may be approximate because there may be too many of them
to be created and maintained or because in some domains
(e.g., bioinformatics) it is not clear what the mappings should
be. Second, the data from the sources may be extracted using
information extraction techniques and so may yield erroneous
data. Third, queries to the system may be posed with
keywords rather than in a structured form. As a first step to
building such a system, we introduce the concept of probabilistic
schema mappings and analyze their formal foundations.
We show that there are two possible semantics for such mappings:
by-table semantics assumes that there exists a correct
mapping but we do not know what it is; by-tuple semantics
assumes that the correct mapping may depend on the particular
tuple in the source data. We present the query complexity
and algorithms for answering queries in the presence of probabilistic
schema mappings, and we describe an algorithm for
efficiently computing the top-k answers to queries in such a
setting. Finally, we consider using probabilistic mappings in
the scenario of data exchange.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s