2. Sorting, Searching and Selection

We begin our study of algorithms by learning some well-known algorithms for three important problems that we always encounter. We also present some properties about the problems that are useful if you want to design new algorithms for these problems

 


2.1 Sorting

Problem 2.1. Sorting an array of n items
Input: An array A[1...n] of items with a total order
Output: Sort the items in A in non-decreasing order

Most of Sorting is using comparison sort. It is a type of sorting algorithm that only reads the array item through a single abstract comparison operation that determines which of two items should occur first in the final sorted array. The Comparison Sort is swaps the items that are in incorrect order. Here will introduce 3 basic comparison sorting algorithms, i.e. Selection Sort, Bubble Sort and Insertion Sort.


Selection Sort

There are many algorithms for sorting an array. Selection sort is the simple and intuitive sorting algorithm. The idea of selection sort is that, it firstly finds the biggest value item in the array A[1...n] and swap it with the 1st item. And then find the biggest value item form the range A[2...n] and swap it with the 2nd. Repeat it until swapping until the size of searching range become A[n-1...n]. All items have been sorted.


selectionSort( Array A[1..n] )
 for( int i = 1 to n )
     // find the index of the smallest element
     int indexOfMin = i;
     for( int j = i+1 to n )
         if( A[indexOfMin] > A[j] )
             indexOfMin = j;
     swap( A[i], A[indexOfMin] );
 return A;

The number of comparisons of selection sort is n(n-1)/2, the Time complexity is O(n2).

 

Insertion Sort

Insertion sort is another simple comparison sorting algorithm. The implementation of insertion sort is very simple, and it is quite efficient for small data sets. Therefore, some advanced hybrid sorting algorithms will use insertion sort when the data set becomes smaller enough. And also, insertion sort is the only one sorting algorithm being suitable for online approach, because insertion sort can sort the array as it receives an element in the array efficiently.

insertionSort( Array A[1..n] )
 for( int i = 2 to n )
     temp = A[i];
     for (int j = i - 1 to 1)
      if(A[j] > temp )
          A[j+1] = A[j]; else break;
     A[j+1] = temp;
 return A;

Although the previous pseudo code example shows how to implement insertion sort with an array, it is better to use a link list instead of an array to implement insertion sort, because link list helps to avoid some unnecessary movements of the elements. The differences between array and link list can be seen in the topic about "Data Structures". Time Complexity of Insertion Sort is O (n2)

 

Bubble Sort

The third sorting algorithm we going to introduce is Bubble sort. It runs by repeatedly visiting the items to be sorted, comparing each pair of adjacent items and swapping them in a correct order. The work to visit items is repeated until the whole array of items is sorted.

bubbleSort( Array A[1..n] )
 for( int i = n to 1 )
     for( int j = 1 to i-1 )
         if( A[j] > A[j+1] )
             swap( A[j], A[j+1] );
 return A;

For n number of items to be sorted, bubble sort requires O(n2) number of comparisons. Since other O(n2) sorting algorithms such as insertion sort have better performance than bubble sort, bubble sort is not a practical sorting algorithm for a large n. Although bubble sort and insertion sort have the same worst case running time complexity, insertion sort still performs better even on random lists. However, because of its simplicity, bubble sort is usually used to introduce the concept of algorithm to the students who firstly learn to program.

 


2.2 Can we sort faster?(Part 1)

Selection Sort, Insertion Sort and Bubble Sort all requires Ω(n2) time in the worst case. To design faster sorting algorithms, it is useful to observe a relationship between sorting and the number of inversion in input.

Definition 2. 1 For an array A[1...n]. Inversion is a pair (i, j), where i < j and A[i] > A[j])

Swapping two adjacent items can reduce the number of inversions by at most one. There are at most n(n-1)/2 inversion in an array, so the worst case of swapping need to take n(n-1)/2 times.

Theorem 2. 1 Any algorithm only change the input by adjacent items take Ω(n2).

The sorting algorithm introduce before are all make the comparison between adjacent items. Hence, to design faster algorithm, we must swap items that are further away. Which are the Merge Sort and Quick Sort.

 

Merge Sort

Merge Sort is a O(n log n) comparison-based sorting algorithm. The key idea of Merge Sort is that, a shorter list takes fewer steps than a longer array
For sorting two arrays with lenght n, it takes 2(n log n), but sorting an array with lenght 2n, it takes (2n)log(2n).
So sorting two

and also constructing a sorted list from two sorted arrays takes lower time complexity than from two unsorted array.

MergeSort( Array Input[1...n] )
   If n <= 1
       ReturnInput[1...n]
   Array leftSubArray = MergeSort(Input[1..n/2] )
   Array rightSubArray = MergeSort(Input[n/2+1..n] )
   Report Merge( leftSubArray, rightSubArray )
Merge( left[1...m], right[1...n] )
   Array result[1...m + n] int leftPosition = 1, rightPosition = 1, resultPosition = 1;
   While leftPosition <= m or rightPosition <= n then
       If leftPosition <= m and rightPosition <= n then
           If left[leftPosition] <= right[rightPosition] then
               result[resultPosition] = left[leftPosition];
               leftPosition++;
           Else
               result[resultPosition] = right[rightPosition];
               rightPosition++;
       Else if leftPosition <= m then
           result[resultPosition] = left[leftPosition];
           leftPosition++;
       Else if rightPosition <= n then
           result[resultPosition] = right[rightPosition];
           rightPosition++; resultPosition++;
   end while
   Return result;

Merge sort works as follows

  1. If the array's length is 0 or 1, then it is already sorted. Else,
  2. Divide the array into two sub-array of about half of the size
  3. Sort each sub-array recursively
  4. Merge the two sorted sub-array into one sorted sorted

Merge of two sorted array

  1. A small list will take fewer steps to sort than a large array.
  2. Fewer steps are required to construct a sorted list from two sorted arrays than from two unsorted arrays.

Consider the recurrence T(n) = 2T(n/2) + O(n) when assuming merging takes O(n), by the master theorem[1] it will be seen that the time complexity is O(n log n)

 

Quick Sort

Quick sort is practically faster sorting algorithm and it is implemented by most libraries as the sorting algortims.Quick sort is a divide and conquer algorithm. It first divides a large list into two smaller sub-lists: the low elements and the high elements. It can then recursively sort the sub-lists.

quicksort(A[1...n])
create Array Less, Greater
if A.length = 1
return A // an array of zero or one elements is already sorted
Randomly select a pivot value P from A
for each item I in A
if I = P then append I to Less
else append I to Greater
return concatenate(quicksort(Less), quicksort(Greater))

When Quick Sort run, it ramdonly Pick pivot from the array. Pivot is the item which is selected first by an algorithm to do certain calculations.

Reorder the array so that all items with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it (equal values can go either way). After this partitioning, the pivot is in its final position. Recursively sort the sub-list of lesser elements and the sub-list of greater elements. The base case of the recursion are lists of size zero or one, which never need to be sorted.

Same as Merge Sort, Time complexity of Quick Sort is T(n) = 2T(n/2) + O(n) . The master theorem tells us that T(n) = O(n log n).

 


2.3 Can we sort faster? (Part 2)

Theorem 2.2: Any comparison-based algorithm take Ω(n log n) time.

The model that we will use for this proof is a decision tree. The root of the decision tree corresponds to the state of the input of the sorting problem. Each internal node of the decision tree represents such a state, as well as two possible decisions that can be made as a result of examining that state, leading to two new states. The leaf nodes are the final states of the algorithm, which in this case correspond to states where the input array is determined to be sorted. The worst-case running time of an algorithm modelled by a decision tree is the height or depth of that tree.

A sorting algorithm can be thought of as generating some permutation of its input. Since the input can be in any order, every permutation is a possible output. In order for a sorting algorithm to be correct in the general case, it must be possible for that algorithm to generate every possible output. Therefore, in the decision tree representing such an algorithm, there must be one leaf for every one of n! permutations of n input values.

Since each comparison results in one of two responses, the decision tree is a binary tree. A binary tree with n! leaves must have a minimum depth of log2 (n!) . But log2 (n!) has a lower bound of Ω(nlogn) [2]. Thus any general sorting algorithm has the same lower bound.

All the algorithms we presented previously only compare the relative magnitude of two items. We show that any comparison-based algorithm take Ω(n log n) time in the worst case. For Sorting much faster, we will introducte the non-comparison base algorithm.

 

Counting Sort

Counting sort is an integer sorting algorithm for sorting a collection of objects according to keys that are small integers. It counts the number of objects that have each different key value, and using arithmetic on those counts to determine the positions of each key value in the output sequence.

Because counting sort uses key values as indexes into an array, it is not a comparison sort, and the Ω(n log n) lower bound for comparison sorting is inapplicable to it.

input : Array Input[1...n]
ouput : Array Output[1...n]
Create array Count[0..k]; initialize each array cell to zero where k is the value maximum in Array Input
for each item I in Input
Count[Input[I]]++
total = 0
for i = 0 to k
c = Count[i]
Count[i] = total
total = total + c
Create an output array Output[0..n]
for each item I in input:
Output[Count[Input[I]]] = Input[I]
Count[Input[I]]++
return Output

Time Complexity of counting sort is O (n + k) where n is the number of items and k is the biggest value in the items.

 


2.4 Searching

Problem 2.2. Searching a target item in an array of n items
Input: An array A[1...n] of items in a sorting order
Output: return the position of target item in A

Searching is algorithm for find out the target items in an array.

Linear search.

Linear search or sequential search is a method for finding a particular value in a total order array, that consists of checking every one of its items, one at a time and in sequence, until the desired one is found. The time complexity of linear search is O(n).Linear search is the simplest searching algorithm , and is not effecient when searching multiple solution in the array.

Binary Search.

Binary Search is different to the general Linear Search, Binary Search works only in a sorted array, certainly for an unsorted array it is no way but only to scan all elements so Linear Search is already optimal. The idea of Binary Search is that, in a sorted array, if the current element is not the target, then the target must be either in the left hand side or right hand side, but not both, which depends on the target is less than or greater than the current element.

BinarySearch( Array[0..n-1], target, lower, upper )
   if lower > upper then
       return "Not Found"
   var mid = (lower + upper) / 2
   if Array[mid] == target then
       return mid
   else if Array[mid] > target then
       return BinarySearch( Array, target, lower, mid-1 )
   else if Array[mid] < target then
       return BinarySearch( Array, target, mid+1, upper )

 


2.5 Quick Select

Problem 2.3. Selection the median of the items in an array
Input: An array A[1...n] of items in a total order
Output: return the median

Quick Select is an algorithm for finding the kth smallest number in a list

In statistics, a median of a set of n numbers is the (n/2)-th smallest element of the set, that is to say, if the numbers in the set have been sorted, then the median is the element to separate the lower half subset and higher half subset of the set, so the median can be found by arranging from lowest value to highest value and picking the middle one. Note that if there are even numbers in the set, then the median is usually defined to be the average of the two middle values.

As mentioned before, sorting can be used for finding the median, however, if do so then there are some dummy operations because we only need to focus on the median one but not the whole sorted list. Since sorting n numbers takes O(n log n) operations, a better way to compute the k-th smallest of n numbers is by selection algorithm which takes only O(n) operations.

The linear algorithm of selecting the k-th smallest number was published by Blum, Floyd, Pratt, Rivest and Tarjan in their 1973 paper "Time bounds for selection", so the linear general selection algorithm is also called "BFPRT algorithm".

The idea of selection algorithm is that, as an example to find the median, based on the quickselect algorithm, if the n elements are divided into groups of several elements (usually divided into groups of five elements), for each group of five elements, the five elements can be sorted very fast, and the median of each group can be found directly, on this sublist of n/5 elements, recursively apply the selection algorithm to find the true median of n numbers. Since in the recursive step the "median of medians" is chosen to be the pivot, so selection algorithm can also be called "median of medians algorithm".

Algorithm

Assume in the set of n elements S, this SELECT algorithm returns the k-th smallest element in S
  1. Divide the n elements into n/5 groups of 5 elements
  2. Sort each group of 5 elements by Insertion-Sort and find the median of each group
  3. Apply this SELECT algorithm recursively to find x, where x denotes the median of the n/5 medians.
  4. Using x as the pivot to separate S, let A be the subset for elements < x, B be the subset for elements = x, and C be the subset for elements > x.
  5. If | A | < k <= | A | + | B |, then return x; else if k <= |A|, use SELECT(A, k) recursively to find the k-th smallest in A; otherwise if k > | A | + | B |, use SELECT( C, k - | A | - | B | ) recursively to find the ( k - | A | - | B | )-th smallest in C.

 

In this graph each column refers to a sorted group of 5 elements, and each element with pink colour refers to the median of the corresponding group.
Note that at least n/4 numbers >= x and at least n/4 numbers <= x, therefore at most 3n/4 numbers < x and at most 3n/4 numbers > x, i.e. |A| <= 3n/4 and |C| <= 3n/4.

In the SELECT algorithm, obviously step 1 and 4 takes O(n), and for step 2 consider sorting only 5 elements can be fast as O(n) because the number of comparisons is very small. Both step 3 and 5 apply SELECT algorithm recursively and the problem size is reduced to be n/5 and 3n/4 respectively.
By recurrence, the time complexity of selection algorithm is:
T(n) <= T(n/5) + T(3n/4) + O(n) <= O(n)

 


2.6 Exercises

Sorting
1002 487-3279
1007 DNA Sorting
1245 Programmer, Rank Thyself
1520 Scramble Sort
2092 Grandpa is Famous
2159 Ancient Cipher
2299 Ultra-QuickSort
Searching
2456 Aggressive cows

 

 

[1] Master theorem provides a cookbook solution in asymptotic terms (using Big O notation) for recurrence relations of types that occur in the analysis of many divide and conquer algorithms.

[2]Asymptotic bounds for factorial

© The University of Hong Kong Algorithms Library - hkual@cs.hku.hk