Current Research


  A. BACKGROUND

There has been extensive work related to top-k queries and skyline queries in database reasarch. The goal of my current project is the extensive study of top-k dominating queries, which is a hybrid of these two queries. This problem has been introduced in SigMod/Pod2005, however, only a skyline-based solution was proposed, which does not consider the special nature of the problem. To our knowledge there is no other work addressing such queries. We will base on existing generic algorithms to propose a efficient skyline algorithm for high dimensional and large data set.

The PI has good knowledge and working experience on problems related to top-k query processing and multi-dimensional data analysis. It has developed algorithms and optimization techniques for top-k and nearest neighbor queries in high-dimensional spaces. In addition, It has worked on view selection and top-k group-by queries for OLAP data. Finally, It has been involved in projects related to query processing over data streams.

  B. Top-k Queries

A top-k query combines different rankings (e.g., based on price, quality, distance) of the same set of objects (e.g., hotels) and returns the k objects with the highest combined score according to an aggregate function (e.g., average). Exemplary applications for this query type include the retrieval of images according to their aggregate similarity to an example image with respect to various features, like color, texture, shape, etc. and e-commerce services sorting their products according to user preferences to facilitate purchase decision. R. Fagin et al.(PODS2001) presented a comprehensive analytical study of various methods for top-k aggregation of ranked inputs by monotone aggregate functions, basen on whether sorted and/or random accesses are allowed to the inputs to be aggregated. The threshold algorithm (TA) applies random accesses for each retrieved object from a sorted list, in order to complete its aggregate score and terminates as soon as the top-k discovered objects so far are guaranteed to be in the result. The no-random-accesses algorithm (NRA) accesses the lists only sequentially, while maintaining lower and upper score bounds for the objects seen so far, until the top-k worst-case scores are higher than the best-case scores of the remaining objects. Top-k queries have also been studied for centralized, relational databases. N. Bruno et al.(TDS2002) process top-k queries, after converting them to multidimensional range queries. They utilize multidimensional histograms to accelerate search. The top-k query is in fact a special case of a top-k join query, where the results of a join are to be output in order of an aggregate score on their various components. In the basic top-k query case, the ordered hobject-id,atomic scorei tuples are joined by object-id and then ranked based on an aggregate function on the scores. PREFER (V. Hristidis et al., VLDBJ2004) is a system for maintaining materialized rankings of the database objects based on some preference function. To evaluate a top-k query, the system selects the materialized view corresponding to the function that is most similar to and examines only a subset of the records in this view. G. Das et al, (VLDB'06) addresses the problem of using a set top-k materalized views in order to answer another top-k query, for views and queries with linear aggregate functions. Another approach issued by Y.-C. Chang et al.,(SIGMOD'00), exploits the geometric properties of linear preference functions; the object with the highest score for any linear preference function is on the convex hull of the multidimensional space formed by the atomic scores of the objects at all attributes. Thus, by materializing and maintaining convex hull layers we can evaluate top-k queries efficiently. An extension of this idea was proposed in VLDB'06 by D. Xin, Jiawei Han et al.. Another method for computing and maintaining top-k query results with the use of materialized views is followed by K. Yi et al.(ICDE2003); for each view, a larger number than k of ranked results is maintained, which requires updating less often. A recent work on the maintenance of top-k query results for streaming data constrained by sliding windows is issued by K. Mouratidis et al. in SIGMOD2006. Finally, an index-based approach for computing top-k query results for centralized data was presented in ICDE2003 by P. Tsaparas et al..

An interesting variant of the query, which we plan to study is the bichromatic top-k dominating query, which retrieves the k objects in a producer dataset DP that dominate the largest number of objects in a consumer DA dataset.

  C. Skyline Queries

Skyline Queries Given a multi-dimensional dataset (where dimensions correspond to atomic rankings or preferences), an object p is part of its skyline, if there exists no other object that ˇ°dominatesˇ± p in all dimensions. Bˇ§orzsˇ§onyi et al. were the first to propose efficient external memory algorithms for finding the skyline of a dataset. The BNL (block-nested-loops) algorithm scans the dataset while employing a bounded buffer for tracking the points that cannot be dominated by other points in the buffer. A point is reported as a result if it cannot be dominated by any other point in the dataset. On the other hand, the DC (divide-and-conquer) algorithm recursively partitions the dataset until each partition is small enough to fit in memory. After the local skylines in each partition are computed, they are merged to form the global skyline. The BNL algorithm was later improved to SFS (sort-filter-skyline) and LESS (linear elimination sort for skyline) in order to optimize the average-case running time.

The above algorithms are generic and applicable for non-indexed data. On the other hand, K.-L. Tan et al VLDB2001,D. Papadias et al.(SIGMOD2001) and D. Kossmann et al.(VLDB02) and so on exploited data indexes to accelerate skyline computation. The state-of-the-art algorithm is the branch-and-bound skyline algorithm [D. Kossmann et al.(VLDB02)], which is shown to be I/O optimal for computing skylines on datasets indexed by R-trees.

Recently, the research focus has been shifted to the study of queries based on the dominance relationship. C. Li et al.(SIGMOD2001) proposed a data cube structure for speeding up the evaluation of queries that analyze the dominance relationship of points in the dataset. However, incremental maintenance of the data cube over updates has not been addressed. C.-Y. Chan,et al. (EDBT2006) identified the problem of computing top-k frequent skyline points, where the frequency of a point is defined by the number of dimensional subspaces, still studied the k-dominant skyline query in SIGMOD2006, which is based on the k-dominance relationship. A point p is said to k-dominate another point p if q dominates p in at least one k-dimensional subspace. The k-dominant skyline contains the points that cannot be k-dominated by any other point. When k decreases, the size of the k-dominant skyline also decreases.

Finally, Y. Yuan et al.(VLDB2006) study the efficient computation of skylines for every subspace; Y. Tao et al.(ICDE2006) propose a technique for retrieving the skyline for a given subspace; Z. Huang et al. (ICDE2006) investigate skyline computation over distributed data; Y. Tao et al(TKDE2006) study continuous maintenance of the skyline over a data stream; and C.-Y. Chan et al., (SIGMOD2005) address skyline computation over datasets with partially-ordered attributes.

  D. Research Issues and Methodologies

Motivated by above observations, we will investigate algorithms that are especially designed for top-k dominating queries and Skyline operator, or operate on aggregate R¨Ctrees. We will outline some potential methods for processing top-k dominating queries vs Skyline Queries,