4.3 Huffman coding Problem: Huffman tree Huffman tree is a full binary tree. It is not difficult to see that the encoding represented by a full binary tree is prefix-free (moreover, it could also be proved that any prefix-free encoding could be represented by a full binary tree). Algorithm To find the encoding tree with minimum weighted path length, one would expect that low weight symbols will have longer tree path while high weight symbols have shorter tree path. Suppose the encoding tree is built from leaf nodes in a bottom-up manner. Then a greedy strategy will first consider low weight leaves since leaves at tree's deepest bottom level will have longest path length. Huffman coding uses this strategy and the algorithm is presented as follows. Huffman-Coding
Correctness: Lemma 1: Let x and y be two symbols having the lowest weights. Then there exists an optimal encoding tree where x and y are siblings and they are at the tree's deepest level. Proof: First note that an optimal encoding tree must be a full binary tree (otherwise we could ‘contract' the tree so that weighted path length reduces). So there are at least two nodes which are at tree's deepest level. Suppose some symbol z other than x (or y) is at the deepest level of an optimal tree. Then by swapping z and x (or y)'s positions we will have a new tree with smaller weighted path length. This contradicts with the tree's optimality. So x, y must be both at the deepest level of an optimal tree. On the other hand, with same swapping argument we could let x, y be siblings. (end of proof) Lemma 2: Let x and y be two symbols having the lowest weights. We remove x, y from the symbol set and add a new symbol z. Symbol z's weight is set as the sum of x, y's weights. Denote the original symbol set as S and the new one as S '. Note that S ' = (S - {x,y}) {z}. Let T ' be an optimal encoding tree for S '. We construct tree T from T ' by replacing the leaf node for z with a new internal node having children x, y. Then T is an optimal encoding tree for original symbol set S. Proof: Let L' be the weighted path length of T ' and L be that of T. It is not difficult to see that L = L' + w(x) + w(y). To form a contradiction, suppose that the optimal encoding tree for S have weighted path length . From Lemma 1 we have that x, y must be at the deepest level of tree. So we can construct a new tree by replacing the parent node of x, y (together with x, y) with a new leaf node for z, then must be an encoding tree for S'. Observe that has weighted path length , which is less than L'. This contradicts with the precondition that T ' is the optimal encoding tree. Thus L is the smallest possible weighted path length for encoding S, which means T is the optimal encoding tree for S. (end of proof) Theorem 3: Algorithm Huffman-Coding produces an optimal encoding tree. Proof: This could be proved by induction on the size of the symbol set S. The base case for | S | = 1 trivially holds. Suppose Huffman-Coding produces an optimal tree for | S | = k. Then considering | S | = k + 1, let x, y be two symbols having the lowest weights. Create new symbol z with w(z) = w(x) + w(y) and let S' = (S - {x,y}) {z}. Let the tree produced by Huffman-Coding for S' be T '. Then T ' is optimal according to induction hypothesis. Note that the tree T produced by Huffman-Coding for S is just expanding leaf node z with children x, y. And by Lemma 2 we know that T is also optimal. Time Complexity: Space Complexity: Pseudo Code
Other references: [1] http://en.wikipedia.org/wiki/Huffman_coding |
© The University of Hong Kong Algorithms Library - hkual@cs.hku.hk |