Stabilization of Fully Dynamic k-Center Clustering

Current Phase: Phase 4 (Fin)

Supervisor: Dr. Hubert Chan

3035123655 Choi Jae Won / 3035123722 Ha Tae Min

Introduction
Background and Objectives

Background

Clustering is the task of grouping a set of objects in a way that objects in the same group (cluster) are more similar to each other than to those in other groups. Most of the clustering algorithms use quantitative or qualitative measures to draw lines between one group and another.

Over 6,000

tweets per second posted

40,000

Queries by Twitter, Google every second

400

hours worth of youtube videos uploaded every minute

2 Billion

Facebook users are added or removed

Today, huge volume of data flows in and out of the internet every second. Therefore, there is an unlimited demand for dynamic machine learning and data mining algorithms. In particular, dynamic cluster algorithms are essential for the real-time update of the arbitrary insertion and deletion of data points.

Main Objectives

(1) Implement the existing algorithm based on Fully Dynamic k-Center Algorithm to a more extendable form in other programming language (Python) so that enhancements can be easily built upon.

(2) Understand different existing metrics from recent research papers for measuring “stability” of the clusters and implement the metric algorithms based on them.

(3) Implement the advanced algorithms (stabilization of the clusters after re-clustering by finding the good candidates for the new centers of the clusters).

(4) Compare the results, using stability measuring metrics, of the existing algorithm and advanced algorithms. Based on the results of different metrics, improve the stability of algorithms.

(5) Visualize the results on multiple datasets based on different measures defined in previous steps.
Implementation
Fully Dynamic k-Center Clustering

What is Fully Dynamic k-Center Clustering?

In many real-world problems, a substantial part of data can be added or removed. Therefore, Fully Dynamic k-Center Clustering is the best fit for its arbitrary insertion and deletion features. Fully Dynamic k-Center Clustering uses (2+e)-approximation algorithm for the k-center clustering problem with “small” amortized cost under the fully dynamic adversarial model.

Arbitrary Insertion and Deletion

Inserted data point is compared to the pre-existing clusters. If the data point is positioned within the radius of any of the k-center, it becomes part of the cluster. If it is not included, the point is classified as an unclustered point.

Deleted data point will be considered in two cases. If the point is not one of the k-centers, it can be removed in O(1) time. If the point is one of the k-centers, some points will lose their center and become unclustered points together with the clusters formulated after the deleted center’s cluster. These points will be re-clustered by selecting uniformly random center points.

Implementation of Existing Algorithm

Although Python is used as a medium language for the project, most of the core functions are already available in C language. Components that can be imported via wrapper libraries, such as Cython or Pyobject, were translated into our system. To measure the successful integration of the components, the portions of the datasets used in Fully Dynamic k-Center Clustering (Twitter, Flickr, and Trajectories) were used to simulate the experiment conducted in the paper.
Stability
Definition of Stability and Different Metrics

How do we define stability?

Due to the nature of the fully dynamic k-center clustering that inserts and deletes the data points dynamically, the clusters are prone to be updated by these operations more often and refactor their structures. Therefore, clusters obtained by dynamic algorithms have a higher chance of redefining the “key” points (centers of the clusters) than those generated by static algorithms. Reassignment of these key points lead to reconstructions of the clusters, and thereby new clusters would contain different data points.

Consequently, the new clusters are, for being formulated by different members and thereby losing previous information, unstable. Hence, stability is how similar the new clusters are to the previous clusters (clusters before re-clustering), where similarity of the clusters can be measured with different metrics.

Measuring Stability: Similarity measures

Recent research paper (Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance) thoroughly discusses about the methods to measure similarities between different clusters. Pair-counting, set-matching, and information theoretic measures are the classes that were tested on our system. Specifically, Adjusted Rand Index (ARI) and Normalized Mutual Information(NMI) were implemented to be tested on our system. The two measures make use of pair-counting or contingency table as the medium of comparison between different clusters.

Measuring Stability: Consistency

The definition of Consistency is defined as the consistency of the centers. In (Consistent K-Clustering), it shows that the number of center changes can be defined as the consistency measure for clusterings. Hence, if the consistency (the centers) can be kept unchanged, it gives more stable environment. The core algorithms in this project were tested with this consistency measure and the lower bound for consistency were also studied.
Enhancement
Proposal of stable algorithm

Reclustering of Original algorithm

The original Fully dynamic k-center clustering can generate unstable clusters whenever re-clustering is done after arbitrary deletion. As seen in the sample figure on the left, the algorithm loses a center at (2), then selects new uniformly random centers afterwards (3). This results in clusters(losing similarities with the original cluster) as in (4) that has different clusters which can be regarded as unstable.

Selective Reclustering Algorithm

The modified algorithm had advantages in similarities or consistency as proposed in the previous phases. As we can observe from the diagram, when deletion occurs, the algorithm only unclusters the data points that are close enough with the deleted center. In this case, since the centers that are far apart from the deleted centers are preserved, and their indices are pushed to the front. Afterwards, the unclustered data points are inserted in the same arbitrary insertion process with choosing uniformly random centers among those unclustered points.

Results

Comparison of algorithms with different stability metrics

The two algorithms, Fully Dynamic k-Center Clustering and Selective Reclustering, were tested on two datasets.
(Twitter and Flicker images datasets with Geographical coordinates: latitude and longitude, 2-dimensional)

Consistency of FDKCC and SR algorithm.
Testing Environment:
Dataset: Twitter (GPS)
Query: Adversarial
(10000 queries of insertions and deletions)

Remark: only deletions on centers(reclusterings) are shown

Consistency of FDKCC and SR algorithm.
Testing Environment:
Dataset: Flicker (images GPS)
Query: Adversarial
(10000 queries of insertions and deletions)

Remark: only deletions on centers(reclusterings) are shown

Consistency was defined as the number of center changes in given timestep t.
The two graphs show similar trend.
The average consistency are as follows :

	Twitter		Consistency
	FDKCC	SR	FDKCC	SR
Consistency	8.082	1.268	8.023	1.232
ARI	0.972	0.998	0.969	0.998
NMI	0.983	0.999	0.981	0.999

Results
Comparison of algorithms with different stability metrics
Documentation

References

Information Theoretic Measures.pdf

fully-dynamic-k-center-clustering

Consistent K-Clustering

Our Works:

Project Plan

Github

Introduction

Background

Main Objectives

(1) Implement the existing algorithm based on Fully Dynamic k-Center Algorithm to a more extendable form in other programming language (Python) so that enhancements can be easily built upon.

(2) Understand different existing metrics from recent research papers for measuring “stability” of the clusters and implement the metric algorithms based on them.

(3) Implement the advanced algorithms (stabilization of the clusters after re-clustering by finding the good candidates for the new centers of the clusters).

(4) Compare the results, using stability measuring metrics, of the existing algorithm and advanced algorithms. Based on the results of different metrics, improve the stability of algorithms.

(5) Visualize the results on multiple datasets based on different measures defined in previous steps.

Implementation

What is Fully Dynamic k-Center Clustering?

Arbitrary Insertion and Deletion

Implementation of Existing Algorithm

Stability

How do we define stability?

Measuring Stability: Similarity measures

Measuring Stability: Consistency

Enhancement

Reclustering of Original algorithm

Selective Reclustering Algorithm

Results

Results

Documentation