Summary:
Data Management in Dynamic Environments
Due to advances in networking technologies and widespread use of
inexpensive sensors and location-sensing devices, it is now possible to
develop systems that enable users to closely interact with the external
world. In particular, retrieving data from external environments in a
correct, efficient and scalable manner becomes essential. This is a
hard problem, because stream data is often generated in a continuous
manner with high frequencies, and can easily exhaust computational and
communication resources (e.g., network bandwidth and sensor battery
power). Moreover, the data obtained, such as location and temperature
readings, are often inherently uncertain due to their continuously
changing characteristics and the limited precision of data-acquisition
hardware. Novel techniques such as uncertainty reasoning, uncertain
data indexing and communication protocols need to be developed.
Please click the following links to see my work in this area up to
date.
Querying
Uncertain Data
Sensor data is often obtained by
sampling. Due to their continuous-changing nature, sensor values are
inherently inaccurate. Ignoring
this aspect can lead to incorrect answers being computed. We
studied the issues of managing uncertain data in a
constantly-evolving environment.
We proposed data models to capture temporal and spatial uncertainty,
where a data item is represented as an interval with a probability
distribution function instead of a single value. These models can be
applied to a wide range of sensor-based applications, and can
be extended to multiple dimensions. With the notion of data
uncertainty, query answers are correct and imprecise. To express this
imprecision, we introduced probabilistic queries, for which answers are
augmented with probability values to indicate their confidence of
satisfying the queries. We further devised a classification scheme of
probabilistic queries, under which query evaluation algorithms were
developed.
An interesting aspect of probabilistic queries is that their answers can be
vague, and the vagueness is an indication of the quality of a probabilistic
answer. We proposed metrics for quantifying probabilistic answer quality of
different query classes. The idea of query-centric update policies was then
investigated, where sensors are selected for updating values based on their
effect on the quality of query answers. Extensive simulations showed that
these policies save network bandwidth.
Join is an important problem in a database. When data values are
uncertain, the notions of join operators need to be changed
accordingly. In a more recent work (under review), we have extended
various comparion operators to support uncertain values. We also
investigate how join queries can be evaluated over uncertain data
attributes.
Recently
we have been working on the prototype of a database system, called U-DBMS,
for managing constantly-evolving uncertain data. U-DBMS is an
enhancement of an open-source object-oriented database system
(PostgreSQL). It provides SQL-like query langauge for easy
specification of uncertain data as well as evaluation of probabilistic
queries. The prototype was demonstrated in [VLDB 2005].
Its detailed project description can be found
here.
Related Papers
- Reynold Cheng, Sarvjeet Singh and
Sunil Prabhakar. U-DBMS: A Database System for Managing
Constantly-Evolving Data. In Very Large Databases Conf. (VLDB
2005 Poster), Trondheim, Norway, Aug 2005. [Paper]
- Reynold
Cheng
and Sunil Prabhakar. Sensors, Uncertainty Models and
Probabilistic Queries. Accepted for publication in Encyclopedia of Database
Technologies and Applications, edited by L. Rivero, J. Doorn and
E. Ferraggine, Idea Group Publishing, 2005. [Chapter]
- Reynold
Cheng
and Sunil Prabhakar. Managing Uncertainty in Sensor Databases.
Special Section on Sensor Network Technology and Sensor
Data Management, SIGMOD Record,
Vol.32, No.4, pp. 41-46,
Dec
2003. [Paper]
- Reynold
Cheng,
Dmitri V. Kalashnikov and Sunil Prabhakar. Evaluating Probabilistic
Queries over Imprecise Data. In Proc. of the ACM Special
Interest Group on Management of Data (ACM
SIGMOD 2003), pp. 551-562, June 2003. Acceptance rate: 15.2%, 52/342. [Paper][Talk]
- Reynold
Cheng,
D. V. Kalashnikov, S. Prabhakar, Evaluating Probabilistic Queries
over Imprecise Data, Technical Report number 02-026, Purdue
University, November 2002. [Report]
- Reynold
Cheng, Dmitri V. Kalashnikov and Sunil Prabhakar. Evaluation of
Probabilistic
Queries over Imprecise Data in Constantly-Evolving
Environments. In Information
Systems (IS). [Paper]
- Reynold
Cheng. Managing Uncertainty in Constantly-Evolving
Environments, PhD
Thesis, Purdue University, May 2005. [Thesis][Defence]
-
Reynold Cheng, S. Singh, Sunil Prabhakar, R. Shah, J. Vitter and Y. Xia.
Efficient Join Processing over Uncertain Data. In the ACM 15th
Conf. on Information and Knowledge Management (ACM CIKM 2006), Arlington, USA
2006. Acceptance rate: 15%. [Paper]
[Back to top]
Querying
Imprecise Moving
Objects
A typical example of data streaming
applications with inherent uncertainty is a location-based service,
which monitors continuously-moving objects and provides services based
on their locations. We applied our data uncertainty models to such
environments, and developed probabilistic query evaluation techniques
for nearest-neighbor queries. These
query algorithms take into account the specific behaviors of moving
objects, e.g., whether they move on a well-defined path. Empirical
studies showed that they can handle answer incorrectness better than
traditional queries.
Related Papers
- Reynold
Cheng, Dmitri
V. Kalashnikov and Sunil Prabhakar. Querying Imprecise Data in
Moving Object Environments. In IEEE
Transactions on Knowledge and Data Engineering (IEEE TKDE), Vol. 16, No. 9,
pp. 1112-1127, Sep 2004.[Paper][Appendix]
- Reynold
Cheng, Sunil
Prabhakar and Dmitri V. Kalashnikov. Querying Imprecise
Data in Moving Object Environments. in
Proc. of the Intl. Conf. on Data Engineering (IEEE ICDE 2003), pp. 723-725,
Bangalore, India, Mar 2003. Acceptance
rate: 13.5%. [Paper][Talk]
- Reynold
Cheng,
Sunil Prabhakar and D. Kalashnikov.
Querying Imprecise Data in Moving
Object Environments, Technical Report number 02-020, Purdue
University, October 2002. [Report]
[Back to top]
Indexing Uncertain Data
Probabilistic queries are guaranteed to be correct and more informative
than traditional queries. However, they are more expensive
to evaluate. For many applications, although answer probabilities are
useful, it is often
only necessary to know whether the probability exceeds a given
threshold. We term this query
Probabilistic Threshold Query (PTQ). In VLDB 2004,
efficient solutions for computation of PTQ were studied. We established
that this problem is theoretically difficult to be solved
efficiently. However, there exist practical solutions. In particular,
we
proposed two index
structures and associated algorithms. The first index scheme is based
on the idea of augmenting R-trees with
uncertainty information. To overcome limitations of the first indexing
scheme, a technique
called variance-based clustering was developed, where data points with
similar degrees of
uncertainty are clustered together. This extensive index structure can
answer queries for various
kinds of uncertainty probability density functions with a
close-to-optimal performance.
Experiments showed that both indexing schemes outperform interval
R-trees. In [VLDB 2005],
we have extended
these indexes to support multi-dimensional uncertain data. The studies
over joins on uncertain data are published in CIKM 2006.
Related
Papers
- Yufei
Tao, Reynold
Cheng, Xiaokui Xiao, WangKay Ngai, Ben Kao and Sunil Prabhakar.
Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability
Density Functions. Accepted in Very Large Databases Conf. (VLDB
2005), Trondheim, Norway, Aug 2005. Acceptance rate: 16.5%, 53/322.
[Paper][Project]
- Reynold Cheng,
Yuni
Xia, Sunil Prabhakar, Rahul Shah and Jeffrey Scott Vitter. Efficient
Indexing Methods for Probabilistic Threshold Queries over Uncertain Data.
In Very Large Databases Conference (VLDB
2004), pp. 876-887,
Toronto, Canada, Sep 2004. Acceptance rate: 16%. [Paper][Talk]
-
Reynold Cheng, S. Singh, Sunil Prabhakar, R. Shah, J. Vitter and Y. Xia.
Efficient Join Processing over Uncertain Data. In the ACM 15th
Conf. on Information and Knowledge Management (ACM CIKM 2006), Arlington, USA
2006. Accepted rate: 15%. [Paper]
[Back to top]
Indexing Moving
Objects
Consider a system when millions of
moving objects are being queried. Due to the high update rate in this
environment, traditional
indexing methods perform poorly. We studied the problem of efficient
indexing for this kind of
environment, where we exploited the properties of moving objects and
designed better indexes for
efficient processing of updates and queries. In [ACM SAC 2005],
statistical information such as mean and variance was proposed as
clustering criteria. In [IEEE
ICDE 2005],
we observed that very often objects stay in the same region (called
quasi-static region) for an extensive
amount of time before leaving it for another one. Based on historical
information, an
algorithm for discovering these regions was devised. A moving object
index was constructed based on
quasi-static regions, which allows better performance compared with
indexes that are built using
only the current positions of moving objects.
Related
Papers
- Y. Xia, R. Cheng, S.
Prabhakar, S. Lei and R. Shah. Indexing Continuously Changing Data with
Mean-Variance Tree. In the Intl. Journal of High Performance
Computing and Networking (IJHPCN): A Special Issue on Recent Advances
in Collaborative Internet Computing, Inderscience Publishers.
- Yuni Xia, Sunil
Prabhakar, Shan Lei, Reynold Cheng
and Rahul Shah. Indexing
Continuously Changing Data with Mean-Variance Tree. Accepted for
publication in
the 20th Annual ACM Symposium on Applied Computing (ACM SAC 2005), Mar 2005. Acceptance
rate: 30% [Paper]
- Reynold
Cheng,
Yuni
Xia, Sunil Prabhakar and Rahul Shah. Change Tolerant Indexing
over Constantly Evolving Data. Accepted for publication in Intl.
Conf. on Data
Engineering (IEEE ICDE 2005),
Tokyo, Japan, Apr 2005. Acceptance
rate: 12.9%, 67/521. [Paper]
- Reynold
Cheng,
Yuni Xia, Sunil Prabhakar and Rahul Shah. Change Tolerant
Indexing over Constantly Evolving Data, Technical Report number
04-006, Purdue University, June 2004. [Report]
[Back to top]
Location Privacy
In
location-based services, a user's location has to be known before she
can be served. However, her location privacy can be lost if she does
not want anyone to know where she is. We wrote a
survey to investigate
the various aspects of location privacy, and the current solutions,
including legislation and technologies like privacy access control
policies and location anonymizers.
We also proposed the use of probabilistic queries for controlling the
trade-off between location privacy and quality of service [Mobile HCI 2004].
In this solution, location privacy is protected by providing the service
provider with location information in a lower spatial and temporal resolution.
In doing so, however, service quality can be degraded. We model the "blurred"
location information as uncertain data, and by using probabilistic querying
techniques a quantification of service quality can be obtained. In a more
recent work, we are using probabilistic queries to study the balance between
privacy and quality of service. A related problem is the prevention of
"inference attack" from such systems, based on the knowledge of history of the
object motion [PET 2006].
Related
Papers
-
Reynold Cheng,
Yu Zhang, Elisa Bertino and Sunil Prabhakar.
Preserving
User Location Privacy in Mobile Data Management Infrastructures. In the
Lecture Notes in Computer Science (LNCS), Privacy Enhancing Technology
Workshop (PET 2006), Cambridge, UK, June 2006. Acceptance
rate: 26%. [Paper][Talk]
- Reynold Cheng
and
Sunil Prabhakar. Using Uncertainty to Provide
Privacy-Preserving and High-Quality Location-Based Service. In the Mobile HCI 2004 workshop
on Location Systems Privacy and
Control, Glasgow, Scotland, September 2004. [Paper][Talk]
- Reynold Cheng. A
Survey on Location
Privacy. CS 626 Information Assurance class project, May 2003. [Report]
[Back to top]
Communication
in Streams and Sensor Networks
Probabilistic queries can be used to select appropriate sensors in a
sensor network that is composed of many low-cost but error-prone
sensors. In [RTCSA 2005,
ACM VSSN
2004] We studied the problem where it is assumed that many sensors
for monitoring the same entity (e.g., temperature of a room) are
available in a bandwidth-limited wireless network. With sufficient
number of sensors, the aggregate sensor reading obtained can have a
better confidence guarantee. However, due to limited network bandwidth,
one can only afford to choose a minimum number of sensors.
Probabilistic queries can help in this context, in which probabilistic
guarantees of answers are used to quantify the minimum number of
sensors to be used to attain a reliable reading. We studied the
sensor selection problem for a number of queries, including range
queries, range count queries and maximum/minimum queries.
There is a trade-off between frequencies of updates from streams and
accuracy of query answers. In [VLDB
2005], we
investigated efficient mechanisms for reducing communication cost,
while guaranteeing that the results are correct within a certain error
threshold. The queries that we studied are continuous "entity-based
queries", which return the names of objects as part of the answers. We
used both real and synthetic datasets to verify that my proposed
protocols can reduce the utilization of network bandwidth.
Related
Papers
- S. Han, E. Chan, R. Cheng and
K. Y. Lam. A Statistics-Based Sensor Selection Scheme for Continuous
Probabilistic Queries in Sensor Network. In the Real Time Systems
Journal (RTS).
- Reynold
Cheng, Ben Kao, Sunil Prabhakar, Alan Kwan and Yicheng Tu. Adaptive
Stream Filters for Entity-based Queries with Non-Value Tolerance.
Accepted in Very Large Databases Conf. (VLDB 2005), Trondheim,
Norway, Aug 2005. Acceptance rate: 16.5%, 53/322.
[Paper][Project]
- Song Han, Edward Chan, Reynold
Cheng and Kam-Yiu Lam. A Statistics-Based Sensor Selection Scheme
for Continuous Probabilistic Queries in Sensor Networks. Accepted in
the 11th IEEE Intl. Conf. on Embedded and Real-Time Computing Systems
and Applications (IEEE RTCSA 2005), Hong Kong, Aug 2005. [Paper][Project]
- Kam-Yiu
Lam, Reynold
Cheng, BiYu Liang and Jo Chau. Sensor Node Selection for
Execution of Continuous Probabilistic Queries in Wireless Sensor
Networks. In ACM 2nd International Workshop on
Video Surveillance and Sensor Networks (ACM
VSSN 2004) in conjunction with 12th ACM
International Conference on Multimedia, pp. 63-71, New York, Oct 2004. [Paper][Talk]
[Back to top]
Real-time
Databases
During my years at the University of Hong Kong, I studied a concurrency
control protocol for a real-time database system, where sensors report
their updates to the system continuously. While the system has to
allocate resources to refresh data upon new updates, it also has to
take care of the timing constraints on transactions. We studied a set
of
concurrency control protocols in [ACM Postgraduate Research
Day
1999; ACM CIKM 1999;
IEEE TC
2003] which not only guarantee transaction
correctness, but also consider the trade-off between data freshness and
timing requirements. We also performed an extensive simulation to
compare
various protocols. My MPhil. thesis in HKU was basd
on these works. In [IS 2002],
We developed concurrency control protocols for a database system that
support transactions with different real-time requirements. We also
wrote a book chapter
about the current state of art of disk management in real-time database
systems.
Related
Papers
- Ben
Kao, Kam-Yiu
Lam, Brad Adelberg, Reynold Cheng and Tony Lee. Maintaining
Temporal Consistency of Discrete Objects in Soft Real-Time Database
Systems, in IEEE Transactions on Computers (IEEE TC), Vol. 52, No. 3, pp.
373-389,
March 2003. [Paper]
- Kam-Yiu Lam,
Tei-Wei Kuo, Ben Kao, Tony S.H. Lee and Reynold Cheng, Evaluation
of Concurrency Control Strategies for Mixed Soft Real-Time Database
Systems, in Information Systems (IS),
Vol. 27, No. 2, pp. 123-149, Elsevier Science, 2002. [Paper]
- Ben
Kao and Reynold
Cheng. Disk I/O Scheduling. In Real-Time Database
Systems:
Architecture and Issues, edited by Kam-Yiu Lam and Tei-Wei Kuo,
Kluwer Academic Publishers, pp. 97-107, Boston, December 2001. [Chapter]
- Reynold
Cheng,
Updates and View Maintenance in Soft Real-Time Database Systems. In
the 2nd ACM Hong Kong Postgraduate Research Day (2nd ACM HK PG Day, Best
Paper Award), University of Hong Kong, October 23, 1999. [Paper]
- Ben
Kao,
K.Y. Lam, Brad
Adelberg, Reynold Cheng and Tony Lee.Updates and View
Maintenance in Soft Real-Time Database Systems, in the Eighth
International Conference on Information and Knowledge Management (ACM
CIKM '99), pp. 300-307, Kansas City, Missouri, USA, Nov., 1999. Acceptance rate: 39% [Paper][Talk]
- Reynold
Cheng. View Updates
and Temporal Correctness in Soft Real-Time Database Systems, MPhil
Thesis, the University of Hong Kong, August 2000. [Thesis][Defence]
- Ben
Kao, K.Y.
Lam, Brad Adelberg, Reynold Cheng, Tony Lee, Updates and
View Maintenance in Soft Real-Time Database Systems, HKU CS
technical report TR-99-06 (1999). [Report]
[Back to top]
Data Mining
I am investigating the issues of mining data streams. When
association-rule mining is applied to a data stream system, the mining
algorithm has to be redesigned so that it is adaptable to the
constantly-arriving data. In particular, when data bursts exceed the
processing and storage capability of the system, the accuracy of mining
results should only be degraded gracefully. Hence there is a trade-off
between the correctness of
mining results, system capability, and timeliness of response. Given an
error tolerance
over the support threshold, preliminary results show that it is
possible to develop an association-rule mining system that is
adaptable to data stream arrival rates.
We are also exploring the issues of data mining in an uncertain
environment. For example, in a moving-object environment, the locations
of objects recorded by the database are inherently uncertain.
Clustering algorithms running over the database values can produce
incorrect results. We are currently developing clustering algorithms
that
can process uncertain data model that we proposed earlier
[PAKDD
2006].
We
studied how association rules can be mined in a more efficient, when
some itemsets are known. The solution is based on some results in
lattice theory [PAKDD
1999].
Related
Papers
-
M. Chau, Reynold Cheng, B. Kao
and J. Ng. Uncertain Data Mining: An Example in Clustering
Location Data. In the
Methodologies for Knowledge Discovery and Data Mining, Pacific-Asia
Conference (PAKDD
2006), Singapore, April 2006.
[Paper]
-
M. Chau, R. Cheng and B.
Kao. Uncertain Data Mining: A New Research Direction. Invited
Paper, in the Workshop on the Sciences of The Artificial (WSA) 2005,
National Dong Hwa University, Taiwan, Dec 2005.
- C.L. Yip, K.K. Loo, Ben Kao, David
Cheung and Reynold
Cheng. LGen - A Lattice-Based Candidate
Set Generation Algorithm for I/O Efficient Association Rule Mining, in the Third Pacific-Asia
Conference on Knowledge Discovery and Data Mining(PAKDD '99), pp.
54-63, Beijing, Apr
1999. Acceptance rate: 18.3% [Paper]
- C.L.
Yip, K.K.
Loo, Ben Kao, David Cheung, Reynold Cheng, LGen - A
Lattice-Based Candidate Set Generation Algorithm for I/O Efficient
Association Rule Mining, HKU CS technical report TR-99-01 (1999). [Report]
[Back to top]