Reynold Cheng's Research Projects

Home | Publications | DBLP Bibliography | NEC CiteSeer| Google


Summary: Data Management in Dynamic Environments

Due to advances in networking technologies and widespread use of inexpensive sensors and location-sensing devices, it is now possible to develop systems that enable users to closely interact with the external world. In particular, retrieving data from external environments in a correct, efficient and scalable manner becomes essential. This is a hard problem, because stream data is often generated in a continuous manner with high frequencies, and can easily exhaust computational and communication resources (e.g., network bandwidth and sensor battery power). Moreover, the data obtained, such as location and temperature readings, are often inherently uncertain due to their continuously changing characteristics and the limited precision of data-acquisition hardware. Novel techniques such as uncertainty reasoning, uncertain data indexing and communication protocols need to be developed.

Please click the following links to see my work in this area up to date.



Querying Uncertain Data

Sensor data is often obtained by sampling. Due to their continuous-changing nature, sensor values are inherently inaccurate. Ignoring this aspect can lead to incorrect answers being computed. We studied the issues of managing uncertain data in a constantly-evolving environment. We proposed data models to capture temporal and spatial uncertainty, where a data item is represented as an interval with a probability distribution function instead of a single value. These models can be applied to a wide range of sensor-based applications, and can
be extended to multiple dimensions. With the notion of data uncertainty, query answers are correct and imprecise. To express this imprecision, we introduced probabilistic queries, for which answers are augmented with probability values to indicate their confidence of satisfying the queries. We further devised a classification scheme of probabilistic queries, under which query evaluation algorithms were developed.

An interesting aspect of probabilistic queries is that their answers can be vague, and the vagueness is an indication of the quality of a probabilistic answer. We proposed metrics for quantifying probabilistic answer quality of different query classes. The idea of query-centric update policies was then investigated, where sensors are selected for updating values based on their effect on the quality of query answers. Extensive simulations showed that these policies save network bandwidth.


Join is an important problem in a database. When data values are uncertain, the notions of join operators need to be changed accordingly. In a more recent work (under review), we have extended various comparion operators to support uncertain values. We also investigate how join queries can be evaluated over uncertain data attributes.

Recently we have been working on the prototype of a database system, called U-DBMS, for managing constantly-evolving uncertain data. U-DBMS is an enhancement of an open-source object-oriented database system (PostgreSQL). It provides SQL-like query langauge for easy specification of uncertain data as well as evaluation of probabilistic queries. The prototype was demonstrated in [VLDB 2005].  Its detailed project description can be found here.


Related Papers

  1. Reynold Cheng, Sarvjeet Singh and Sunil Prabhakar. U-DBMS: A Database System for Managing Constantly-Evolving Data. In Very Large Databases Conf. (VLDB 2005 Poster), Trondheim, Norway, Aug 2005. [Paper]
  2. Reynold Cheng and Sunil Prabhakar.  Sensors, Uncertainty Models and Probabilistic Queries. Accepted for publication in Encyclopedia of Database Technologies and Applications, edited by L. Rivero, J. Doorn and E. Ferraggine, Idea Group Publishing, 2005. [Chapter]
  3. Reynold Cheng  and Sunil Prabhakar.  Managing Uncertainty in Sensor Databases.   Special Section on Sensor Network Technology and Sensor Data Management, SIGMOD Record, Vol.32, No.4, pp. 41-46, Dec 2003. [Paper]
  4. Reynold Cheng,  Dmitri V. Kalashnikov and Sunil Prabhakar. Evaluating Probabilistic Queries over Imprecise Data.  In Proc. of the ACM Special Interest Group on Management of Data (ACM SIGMOD 2003), pp. 551-562, June 2003. Acceptance rate: 15.2%, 52/342. [Paper][Talk]
  5. Reynold Cheng, D. V. Kalashnikov, S. Prabhakar, Evaluating Probabilistic Queries over Imprecise Data, Technical Report number 02-026, Purdue University, November 2002. [Report]
  6. Reynold Cheng, Dmitri V. Kalashnikov and Sunil Prabhakar. Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments. In Information Systems (IS). [Paper]
  7. Reynold Cheng.  Managing Uncertainty in Constantly-Evolving Environments, PhD Thesis, Purdue University, May 2005.  [Thesis][Defence]
  8. Reynold Cheng, S. Singh, Sunil Prabhakar, R. Shah, J. Vitter and Y. Xia. Efficient Join Processing over Uncertain Data. In the ACM 15th Conf. on Information and Knowledge Management (ACM CIKM 2006), Arlington, USA 2006. Acceptance rate: 15%. [Paper]
[Back to top]



Querying Imprecise Moving Objects 

A typical example of data streaming applications with inherent uncertainty is a location-based service, which monitors continuously-moving objects and provides services based on their locations. We applied our data uncertainty models to such environments, and developed probabilistic query evaluation techniques for nearest-neighbor queries. These query algorithms take into account the specific behaviors of moving objects, e.g., whether they move on a well-defined path. Empirical studies showed that they can handle answer incorrectness better than traditional queries.


Related Papers
  1. Reynold Cheng, Dmitri V. Kalashnikov and Sunil Prabhakar.  Querying Imprecise Data in Moving Object Environments. In IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), Vol. 16, No. 9, pp. 1112-1127, Sep 2004.[Paper][Appendix]
  2. Reynold Cheng, Sunil Prabhakar and Dmitri V.   Kalashnikov. Querying Imprecise Data in Moving Object Environments.   in Proc. of the Intl. Conf. on Data Engineering (IEEE ICDE 2003), pp. 723-725, Bangalore, India, Mar 2003. Acceptance rate: 13.5%. [Paper][Talk]
  3. Reynold Cheng, Sunil Prabhakar and D. Kalashnikov. Querying Imprecise Data in Moving Object Environments, Technical Report  number 02-020, Purdue University, October 2002. [Report]
[Back to top]



Indexing Uncertain Data


Probabilistic queries are guaranteed to be correct and more informative than traditional queries. However, they are more expensive to evaluate. For many applications, although answer probabilities are useful, it is often only necessary to know whether the probability exceeds a given threshold. We term this query Probabilistic Threshold Query (PTQ). In VLDB 2004, efficient solutions for computation of PTQ were studied. We established that this problem is theoretically difficult to be solved efficiently. However, there exist practical solutions. In particular, we proposed two index structures and associated algorithms. The first index scheme is based on the idea of augmenting R-trees with uncertainty information. To overcome limitations of the first indexing scheme, a technique called variance-based clustering was developed, where data points with similar degrees of uncertainty are clustered together. This extensive index structure can answer queries for various kinds of uncertainty probability density functions with a close-to-optimal performance. Experiments showed that both indexing schemes outperform interval R-trees.  In [VLDB 2005], we have extended these indexes to support multi-dimensional uncertain data.  The studies over joins on uncertain data are published in CIKM 2006.


Related Papers
  1. Yufei Tao, Reynold Cheng, Xiaokui Xiao, WangKay Ngai, Ben Kao and Sunil Prabhakar. Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions. Accepted in Very Large Databases Conf. (VLDB 2005), Trondheim, Norway, Aug 2005. Acceptance rate: 16.5%, 53/322. [Paper][Project]
  2. Reynold Cheng, Yuni Xia, Sunil Prabhakar, Rahul Shah and Jeffrey Scott Vitter.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data. In Very Large Databases Conference (VLDB 2004), pp. 876-887, Toronto, Canada, Sep 2004. Acceptance rate: 16%. [Paper][Talk]
  3. Reynold Cheng, S. Singh, Sunil Prabhakar, R. Shah, J. Vitter and Y. Xia. Efficient Join Processing over Uncertain Data. In the ACM 15th Conf. on Information and Knowledge Management (ACM CIKM 2006), Arlington, USA 2006. Accepted rate: 15%. [Paper]

[Back to top]



Indexing Moving Objects 

Consider a system when millions of moving objects are being queried. Due to the high update rate in this environment, traditional indexing methods perform poorly. We studied the problem of efficient indexing for this kind of environment, where we exploited the properties of moving objects and designed better indexes for efficient processing of updates and queries. In
[ACM SAC 2005], statistical information such as mean and variance was proposed as clustering criteria. In [IEEE ICDE 2005], we observed that very often objects stay in the same region (called quasi-static region) for an extensive amount of time before leaving it for another one. Based on historical information, an algorithm for discovering these regions was devised. A moving object index was constructed based on quasi-static regions, which allows better performance compared with indexes that are built using only the current positions of moving objects.


Related Papers
  1. Y. Xia, R. Cheng, S. Prabhakar, S. Lei and R. Shah.  Indexing Continuously Changing Data with Mean-Variance Tree.  In the Intl. Journal of High Performance Computing and Networking (IJHPCN): A Special Issue on Recent Advances in Collaborative Internet Computing, Inderscience Publishers.
  2. Yuni Xia, Sunil Prabhakar, Shan Lei, Reynold Cheng and Rahul Shah. Indexing Continuously Changing Data with Mean-Variance Tree. Accepted for publication in the 20th Annual ACM Symposium on Applied Computing (ACM SAC 2005), Mar 2005. Acceptance rate: 30% [Paper]
  3. Reynold Cheng, Yuni Xia, Sunil Prabhakar and Rahul Shah.  Change Tolerant Indexing over Constantly Evolving Data. Accepted for publication in Intl. Conf. on Data Engineering (IEEE ICDE 2005), Tokyo, Japan, Apr 2005. Acceptance rate: 12.9%, 67/521. [Paper]
  4. Reynold Cheng, Yuni Xia, Sunil Prabhakar and Rahul Shah.  Change Tolerant Indexing over Constantly Evolving Data, Technical Report number 04-006, Purdue University, June 2004. [Report]

[Back to top]



Location Privacy

In location-based services, a user's location has to be known before she can be served. However, her location privacy can be lost if she does not want anyone to know where she is. We wrote a survey to investigate the various aspects of location privacy, and the current solutions, including legislation and technologies like privacy access control policies and location anonymizers.

We also proposed the use of probabilistic queries for controlling the trade-off between location privacy and quality of service [Mobile HCI 2004]. In this solution, location privacy is protected by providing the service provider with location information in a lower spatial and temporal resolution. In doing so, however, service quality can be degraded. We model the "blurred" location information as uncertain data, and by using probabilistic querying techniques a quantification of service quality can be obtained. In a more recent work, we are using probabilistic queries to study the balance between privacy and quality of service. A related problem is the prevention of "inference attack" from such systems, based on the knowledge of history of the object motion [PET 2006].


Related Papers
  1. Reynold Cheng, Yu Zhang, Elisa Bertino and Sunil Prabhakar.  Preserving User Location Privacy in Mobile Data Management Infrastructures. In the Lecture Notes in Computer Science (LNCS), Privacy Enhancing Technology Workshop (PET 2006), Cambridge, UK, June 2006.  Acceptance rate: 26%. [Paper][Talk]
  2. Reynold Cheng and Sunil Prabhakar.  Using Uncertainty to Provide Privacy-Preserving and High-Quality Location-Based Service. In the Mobile HCI 2004 workshop on Location Systems Privacy and Control, Glasgow, Scotland, September 2004. [Paper][Talk]
  3. Reynold Cheng. A Survey on Location Privacy. CS 626 Information Assurance class project, May 2003. [Report]

[Back to top]



Communication in Streams and Sensor Networks

Probabilistic queries can be used to select appropriate sensors in a sensor network that is composed of many low-cost but error-prone sensors. In [RTCSA 2005, ACM VSSN 2004] We studied the problem where it is assumed that many sensors for monitoring the same entity (e.g., temperature of a room) are available in a bandwidth-limited wireless network. With sufficient number of sensors, the aggregate sensor reading obtained can have a better confidence guarantee. However, due to limited network bandwidth, one can only afford to choose a minimum number of sensors. Probabilistic queries can help in this context, in which probabilistic guarantees of answers are used to quantify the minimum number of sensors to be used to attain a reliable reading.  We studied the sensor selection problem for a number of queries, including range queries, range count queries and maximum/minimum queries.

There is a trade-off between frequencies of updates from streams and accuracy of query answers. In
[VLDB 2005], we investigated efficient mechanisms for reducing communication cost, while guaranteeing that the results are correct within a certain error threshold. The queries that we studied are continuous "entity-based queries", which return the names of objects as part of the answers. We used both real and synthetic datasets to verify that my proposed protocols can reduce the utilization of network bandwidth.


Related Papers
  1. S. Han, E. Chan, R. Cheng and K. Y. Lam. A Statistics-Based Sensor Selection Scheme for Continuous Probabilistic Queries in Sensor Network. In the Real Time Systems Journal (RTS).
  2. Reynold Cheng, Ben Kao, Sunil Prabhakar, Alan Kwan and Yicheng Tu. Adaptive Stream Filters for Entity-based Queries with Non-Value Tolerance. Accepted in Very Large Databases Conf. (VLDB 2005), Trondheim, Norway, Aug 2005. Acceptance rate: 16.5%, 53/322.  [Paper][Project] 
  3. Song Han, Edward Chan, Reynold Cheng and Kam-Yiu Lam. A Statistics-Based Sensor Selection Scheme for Continuous Probabilistic Queries in Sensor Networks. Accepted in the 11th IEEE Intl. Conf. on Embedded and Real-Time Computing Systems and Applications (IEEE RTCSA 2005), Hong Kong, Aug 2005. [Paper][Project]
  4. Kam-Yiu Lam, Reynold Cheng, BiYu Liang and Jo Chau.  Sensor Node Selection for Execution of Continuous Probabilistic Queries in Wireless Sensor Networks. In ACM 2nd International Workshop on Video Surveillance and Sensor Networks (ACM VSSN 2004) in conjunction with 12th ACM International Conference on Multimedia, pp. 63-71, New York, Oct 2004. [Paper][Talk]

[Back to top]



Real-time Databases

During my years at the University of Hong Kong, I studied a concurrency control protocol for a real-time database system, where sensors report their updates to the system continuously. While the system has to allocate resources to refresh data upon new updates, it also has to take care of the timing constraints on transactions. We studied a set of concurrency control protocols in
[ACM Postgraduate Research Day 1999; ACM CIKM 1999; IEEE TC 2003] which not only guarantee transaction correctness, but also consider the trade-off between data freshness and timing requirements. We also performed an extensive simulation to compare various protocols. My MPhil. thesis in HKU was basd on these works. In [IS 2002], We developed concurrency control protocols for a database system that support transactions with different real-time requirements. We also wrote a book chapter about the current state of art of disk management in real-time database systems.


Related Papers
  1. Ben Kao, Kam-Yiu Lam, Brad Adelberg, Reynold Cheng and Tony Lee. Maintaining Temporal Consistency of Discrete Objects in Soft Real-Time Database Systems, in IEEE Transactions on Computers (IEEE TC), Vol. 52, No. 3, pp. 373-389, March 2003. [Paper]
  2. Kam-Yiu Lam, Tei-Wei Kuo, Ben  Kao, Tony S.H. Lee and Reynold Cheng, Evaluation of Concurrency Control Strategies for Mixed Soft Real-Time Database Systems, in Information Systems (IS), Vol. 27, No. 2, pp. 123-149, Elsevier Science, 2002. [Paper]
  3. Ben Kao and Reynold Cheng. Disk I/O Scheduling. In Real-Time Database Systems: Architecture and Issues, edited by Kam-Yiu Lam and Tei-Wei Kuo, Kluwer Academic Publishers, pp. 97-107, Boston,  December 2001. [Chapter]
  4. Reynold Cheng, Updates and View Maintenance in Soft Real-Time Database Systems. In the 2nd ACM Hong Kong Postgraduate Research Day (2nd ACM HK PG Day, Best Paper Award), University of Hong Kong, October 23, 1999. [Paper]
  5. Ben Kao, K.Y. Lam, Brad Adelberg, Reynold Cheng and Tony Lee.Updates and View Maintenance in Soft Real-Time Database Systems, in the Eighth International Conference on Information and Knowledge Management (ACM CIKM '99), pp. 300-307, Kansas City, Missouri, USA, Nov., 1999. Acceptance rate: 39% [Paper][Talk]
  6. Reynold Cheng.  View Updates and Temporal Correctness in Soft Real-Time Database Systems, MPhil Thesis, the University of Hong Kong, August 2000.  [Thesis][Defence]
  7. Ben Kao, K.Y. Lam, Brad Adelberg, Reynold Cheng, Tony Lee, Updates and View Maintenance in Soft Real-Time Database Systems, HKU CS technical report TR-99-06 (1999). [Report]

[Back to top]



Data Mining

I am investigating the issues of mining data streams. When association-rule mining is applied to a data stream system, the mining algorithm has to be redesigned so that it is adaptable to the constantly-arriving data. In particular, when data bursts exceed the processing and storage capability of the system, the accuracy of mining results should only be degraded gracefully. Hence there is a trade-off between the correctness of mining results, system capability, and timeliness of response. Given an error tolerance over the support threshold, preliminary results show that it is possible to develop an association-rule mining system that is adaptable to data stream arrival rates.

We are also exploring the issues of data mining in an uncertain environment. For example, in a moving-object environment, the locations of objects recorded by the database are inherently uncertain. Clustering algorithms running over the database values can produce incorrect results. We are currently developing clustering algorithms that can process uncertain data model that we proposed earlier
[PAKDD 2006].

We studied how association rules can be mined in a more efficient, when some itemsets are known. The solution is based on some results in lattice theory [PAKDD 1999].



Related Papers
  1. M. Chau, Reynold Cheng, B. Kao and J. Ng. Uncertain Data Mining: An Example in Clustering Location Data. In the Methodologies for Knowledge Discovery and Data Mining, Pacific-Asia Conference (PAKDD 2006), Singapore, April 2006. [Paper]
  2. M. Chau, R. Cheng and B. Kao. Uncertain Data Mining: A New Research Direction. Invited Paper, in the Workshop on the Sciences of The Artificial (WSA) 2005, National Dong Hwa University, Taiwan, Dec 2005.
  3. C.L. Yip, K.K. Loo, Ben Kao, David Cheung and Reynold Cheng. LGen - A Lattice-Based Candidate Set Generation Algorithm for I/O Efficient Association Rule Mining,  in the Third Pacific-Asia Conference on Knowledge Discovery and Data Mining(PAKDD '99), pp. 54-63, Beijing, Apr 1999. Acceptance rate: 18.3% [Paper]
  4. C.L. Yip, K.K. Loo, Ben Kao, David Cheung, Reynold Cheng, LGen - A Lattice-Based Candidate Set Generation Algorithm for I/O Efficient Association Rule Mining, HKU CS technical report TR-99-01 (1999). [Report]

[Back to top]


Home | Publications | DBLP Bibliography | NEC CiteSeer| Google