|
28 Dec 2006
Integrating XML data sources using approximate joins

Speaker: Jim GONG
Abstract
XML is widely recognized as the data interchange standard of tomorrow because of
its ability to represent data from a variety of sources. Hence, XML is likely to
be the format through which data from multiple sources is integrated. In this
article, we study the problem of integrating XML data sources through
correlations realized as join operations. A challenging aspect of this operation
is the XML document structure. Two documents might convey approximately or
exactly the same information but may be quite different in structure.
Consequently, an approximate match in structure, in addition to content, has to
be folded into the join operation. We quantify an approximate match in structure
and content for pairs of XML documents using well defined notions of distance.
We show how notions of distance that have metric properties can be incorporated
in a framework for joins between XML data sources and introduce the idea of
reference sets to facilitate this operation. Intuitively, a reference set
consists of data elements used to project the data space. We characterize what
constitutes a good choice of a reference set, and we propose sampling-based
algorithms to identify them. We then instantiate our join framework using the
tree edit distance between a pair of trees. We next turn our attention to
utilizing well known index structures to improve the performance of approximate
XML join operations. We present a methodology enabling adaptation of index
structures for this problem, and we instantiate it in terms of the R-tree. We
demonstrate the practical utility of our solutions using large collections of
real and synthetic XML data sets, varying parameters of interest, and
highlighting the performance benefits of our approach.
Read the Presentation
Slides...
|