4.3 Huffman coding

Problem:
Encoding is often a procedure during data compression and communication. Given a set of symbols, an encoding algorithm should decide the mapping of each symbol into a corresponding bit codeword.
The bit codewords could be uniform-length or variable-length (e.g. ASCII uses 8-bit uniform-length encoding). If the occurrence frequencies of different symbols are not uniformly distributed, then a variable-length encoding could be more efficient in the sense that it would use fewer bits to encode a document of symbols. However, if not well designed, variable-length code may have decoding problems. For example, let character ‘A' be encoded 0 and character ‘B' be encoded 00, then code text ‘000' would have three interpretations: ‘AB', ‘BA' or ‘AAA'. Fortunately this ambiguity could be avoided by insisting prefix-free property, i.e. no code string is a prefix of any other. Such codes are also called prefix codes.
Let   be the weight (frequencies) of given symbols. Let be the encoded code strings. The weighted path length is defined as . Note that weighted path length is the expected length of bits used to encode a symbol. Our goal is to find an optimal prefix-free encoding so that the weighted path length is minimized.

Huffman tree
Huffman coding is a greedy algorithm for finding optimal prefix-free encoding. It constructs a binary tree (Huffman tree) with a bottom-up approach. Each tree leaf corresponds to a symbol and the path from tree root down to the leaf corresponds to the binary code string.

Huffman tree is a full binary tree. It is not difficult to see that the encoding represented by a full binary tree is prefix-free (moreover, it could also be proved that any prefix-free encoding could be represented by a full binary tree).

Algorithm

To find the encoding tree with minimum weighted path length, one would expect that low weight symbols will have longer tree path while high weight symbols have shorter tree path. Suppose the encoding tree is built from leaf nodes in a bottom-up manner. Then a greedy strategy will first consider low weight leaves since leaves at tree's deepest bottom level will have longest path length. Huffman coding uses this strategy and the algorithm is presented as follows.

Huffman-Coding

  1. Initialize set S, which contains n leaf nodes with weight .
  2. Loop for n-1 times:
    Select two nodes x, y that have least weight, delete x, y from S.
    Create a new node w with x, y being its two children nodes and the sum of x, y's weight being its weight.
    Add w into S.
  3. Return the only node in S.

Correctness:
We will perform induction on the size of the encoding tree to show the optimality of Huffman coding. Critical properties of an optimal tree's substructures are summarized as following lemmas.

Lemma 1: Let x and y be two symbols having the lowest weights. Then there exists an optimal encoding tree where x and y are siblings and they are at the tree's deepest level.

Proof: First note that an optimal encoding tree must be a full binary tree (otherwise we could ‘contract' the tree so that weighted path length reduces). So there are at least two nodes which are at tree's deepest level. Suppose some symbol z other than x (or y) is at the deepest level of an optimal tree. Then by swapping z and x (or y)'s positions we will have a new tree with smaller weighted path length. This contradicts with the tree's optimality. So x, y must be both at the deepest level of an optimal tree. On the other hand, with same swapping argument we could let x, y be siblings. (end of proof)

Lemma 2: Let x and y be two symbols having the lowest weights. We remove x, y from the symbol set and add a new symbol z. Symbol z's weight is set as the sum of x, y's weights. Denote the original symbol set as S and the new one as S '. Note that S ' = (S - {x,y})  {z}. Let T ' be an optimal encoding tree for S '. We construct tree T from T ' by replacing the leaf node for z with a new internal node having children x, y. Then T is an optimal encoding tree for original symbol set S.

Proof: Let L' be the weighted path length of T ' and L be that of T. It is not difficult to see that L = L' + w(x) + w(y). To form a contradiction, suppose that the optimal encoding tree for S have weighted path length . From Lemma 1 we have that x, y must be at the deepest level of tree. So we can construct a new tree by replacing the parent node of x, y (together with x, y) with a new leaf node for z, then must be an encoding tree for S'. Observe that has weighted path length , which is less than L'. This contradicts with the precondition that T ' is the optimal encoding tree. Thus L is the smallest possible weighted path length for encoding S, which means T is the optimal encoding tree for S. (end of proof)
Now we are ready to proof the overall optimality.

Theorem 3: Algorithm Huffman-Coding produces an optimal encoding tree.

Proof: This could be proved by induction on the size of the symbol set S. The base case for | S | = 1 trivially holds. Suppose Huffman-Coding produces an optimal tree for | S | = k. Then considering | S | = k + 1, let x, y be two symbols having the lowest weights. Create new symbol z with w(z) = w(x) + w(y) and let S' = (S - {x,y}) {z}. Let the tree produced by Huffman-Coding for S' be T '. Then T ' is optimal according to induction hypothesis. Note that the tree T produced by Huffman-Coding for S is just expanding leaf node z with children x, y. And by Lemma 2 we know that T is also optimal.

Time Complexity:
The node set S would be maintained as a priority queue to efficiently select two least weight nodes in each iteration. Each select operation would have an O( log n) time cost. So the total time complexity is O(n log n).

Space Complexity:
The tree could be represented by an integer array with O(n) space. The priority queue takes O(n) space. So the overall space complexity is O(n).

Pseudo Code

Huffman coding
Input: an array s[1..n] with properties n (node number) and w (weight)
Output: an array T[1..2n-1] representing the Huffman tree

    create a priority queue Q ordered by w
    for i = 1 to n
        insert(Q, s[i])
    for i = n+1 to 2n-1
        x = extractmin(Q)
        y = extractmin(Q)
        create a new node z with node number n(z) = i and weight w(z) = w(x) + w(y)
        T[n(x)] = i
    T[n(y)] = i
    insert(Q, z)
        return T

Other references:

[1] http://en.wikipedia.org/wiki/Huffman_coding
[2] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest, Clifford Stein. Introduction to Algorithms, Second Edition. The MIT Press, Massachusetts, 2001
[3] Dasgupta, S., Papadimitriou, C. H., and Vazirani, U. V. Algorithms. McGraw-Hill Higher Education, 2006.

© The University of Hong Kong Algorithms Library - hkual@cs.hku.hk