7/16/02 Amir Kamil Topic: BSTs, Heaps, Algorithm Analysis, Huffman Encoding Binary Trees and BSTs: - We'll look at BSTs that hold integers as values, and use their values as keys. We will also ignore duplicate keys. - Every key to the left of a node must be less than that node's key, every key to the right of a node must be greater than that node's key. - Insertion into BST: The algorithm is very simple. Start at the root, if the key to insert is less than the root's key, recurse on left subtree. Otherwise recurse on right subtree. When you reach null, add the new node there. Ex: Insert 6 into the following tree: 4 / \ 3 8 / / \ 1 5 11 Run the algorithm, with the current node denoted by *. 4* / \ 3 8 / / \ 1 5 11 Since 6 > 4, recurse on right subtree. 4 / \ 3 8* / / \ 1 5 11 Since 6 < 8, recurse on left subtree. 4 / \ 3 8 / / \ 1 5* 11 Since 6 > 5, recurse on right subtree. 4 / \ 3 8 / / \ 1 5 11 \ * Now we have reached null, so we add 6 at the current spot. 4 / \ 3 8 / / \ 1 5 11 \ 6 - Searching in a BST is the same as insertion, except you stop when you find the node you're looking for (it has a key equal to the one you want). If you reach null, then the key is not in the BST. - Removal from BST: There are three cases: the node to be removed has no children, it has one child, or it has two children. Case 1: the node has no children. Just replace the node with null. Ex: Remove 11 from the above tree results in: 4 / \ 3 8 / / 1 5 \ 6 Case 2: the node has one child. Replace the node with its child. Ex: Remove 8 from the above tree results in: 4 / \ 3 5 / \ 1 6 Case 3: the node has two children. Replace the node with the leftmost descendent of its right child. Ex. Remove 4 from: 4 / \ 3 8 / / \ 1 5 11 \ 6 The leftmost descendent of 4's right child is 5. So we replace 4 with 5, and move 6 up to where 5 was: 5 / \ 3 8 / / \ 1 6 11 Now if we remove 8, we get: 5 / \ 3 11 / / 1 6 - Balance: A binary tree is balanced if the difference in height between its subtrees is at most 1, and its subtrees are balanced. 4 / \ 3 5 / \ 1 6 balanced 4 / \ 3 5 / \ 2 6 / \ 1 7 not balanced 6 / \ 3 7 / \ \ 2 5 8 / 1 balanced Algorithm: isBalanced(node) |height(node.left) - height(node.right)| <= 1 and isBalanced(node.left) and isBalanced(node.right) - Completeness: A binary tree is complete if all levels except the last are full, and the nodes in the last level are all the way to the left. 4 / \ 3 5 / \ 1 6 not complete 5 / \ 2 6 / \ / 1 3 7 complete 6 / \ 3 7 / \ \ 2 5 8 / 1 not complete Algorithms for computing completeness involve a breadth-first search on the tree. The one I wrote for HW3 used the fact that a complete binary tree could be represented compactly using an array, with no empty cells between elements: Convert tree into array. If there are gaps between elements, tree is not complete. Else tree is complete. Heaps: - Heaps are of type min-heap or max-heap. In a min-heap, each node in the tree has priority less than each of its children. In a max-heap, each node in the tree has priority greater than each of its children. - We will use a max-heap of ints. - A heap is implemented using a complete binary tree, so we can represent it using an array. For a node located at position n, its children are at 2n+1 and 2n+2. Ex: 7 / \ 3 6 / \ / 1 2 5 The root is at position 0. --- --- --- --- --- --- --- --- | 7 || 3 || 6 || 1 || 2 || 5 || || | --- --- --- --- --- --- --- --- - Insertion into heap: We insert at the bottom of the heap, then reheapify up. We reheapify up by comparing the new item with its parent, if the new item is larger, we swap it with its parent and recurse, otherwise we stop. Ex: Add 8 to the above heap. We stick it at the bottom. 7 / \ 3 6 / \ / \ 1 2 5 8 Since 8 > 6, we swap them. 7 / \ 3 8 / \ / \ 1 2 5 6 Since 8 > 7, we swap them. 8 / \ 3 7 / \ / \ 1 2 5 6 Now we're done. The array representation: --- --- --- --- --- --- --- --- | 8 || 3 || 7 || 1 || 2 || 5 || 6 || | --- --- --- --- --- --- --- --- Now let's add 4 to the heap. We'll work directly on the array this time. --- --- --- --- --- --- --- --- | 8 || 3 || 7 || 1 || 2 || 5 || 6 || 4 | --- --- --- --- --- --- --- --- The parent of a node at position k is given by floor((k-1)/2). Since 4 is at position 7, its parent is at floor((7-1)/2) = 3. Since 4 > 1, we swap them. --- --- --- --- --- --- --- --- | 8 || 3 || 7 || 4 || 2 || 5 || 6 || 1 | --- --- --- --- --- --- --- --- Now 4 is at position 3, so its parent is at floor((3-1)/2) = 1. Since 4 > 3, we swap them. --- --- --- --- --- --- --- --- | 8 || 4 || 7 || 3 || 2 || 5 || 6 || 1 | --- --- --- --- --- --- --- --- Now 4 is at position 1, so its parent is at floor((1-1)/2) = 0. Since 4 < 8, we're done. - Removing from the heap: We only need bother with removing the top of the heap, since that's why we're using a heap in the first place. So we remove the top item in the heap, replacing it with the bottom-most item in the heap (the one all the way at the end in the array representation). Then we reheapify down. We compare the new top item with its children, swap it with its largest child if that child is larger than the item, and recurse. Otherwise we stop. Ex: Remove 8 from the above heap. We remove 8, replace it by 1. --- --- --- --- --- --- --- --- | 1 || 4 || 7 || 3 || 2 || 5 || 6 || | --- --- --- --- --- --- --- --- Now we compare 1 to its children, 4 and 7. 7 is the largest, so we swap 1 and 7. --- --- --- --- --- --- --- --- | 7 || 4 || 1 || 3 || 2 || 5 || 6 || | --- --- --- --- --- --- --- --- Again we compare 1 to its children, 5 and 6. 6 is the largest, so we swap 1 and 6. --- --- --- --- --- --- --- --- | 7 || 4 || 6 || 3 || 2 || 5 || 1 || | --- --- --- --- --- --- --- --- We're done. Algorithm Analysis: - General algorithm for algorithm analysis: Let A be the expression you want to analyze. Then A is composed of subexpressions A1, A2, A3, ... that each execute c1, c2, c3, ... times. Then O[A] = c1 * O[A1] + c2 * O[A2] + c3 * O[A3] + ... where O[X] represents the running time of X. Ex: int increment(int n) { return n + 1; } This function has one subexpression that takes constant time to execute and executes only once. So c1 = 1, O[A1] = c, and O[increment] = 1 * c = c, and increment is constant in n. Ex: int factorial(int n) { for (int i = n; i > 0; i++) { n *= i; } return n; } This function has a subexpression (n *= i) that takes constant time to execute, and this subexpression is executed n times. So c1 = n, O[A1] = c, and O[factorial] = n * c = cn. So factorial is linear in n. Ex: int foo(int n) { int x; for (int i = 0; i < n; i++) { x += i; } for (int i = 1; i < n/2; i++) { x *= i; } return x; } In this case, c1 = n, O[A1] = c, c2 = n/2, O[A2] = c. Then O[foo] = n * c + n/2 * c = 1.5cn. So foo is linear in n. Remember that constant factors don't count in determining asymptotic behavior. Ex: int bar(int n) { int x; for (int i = 0; i < n; i++) { for (int j = i; j < n; j++) { x += 1; } } return x; } This function has one subexpression, the inner loop, that executes n times. What is the running time of the subexpression? The subexpression has one subexpression of its own that executes n - i times and is constant. So the running time of the inner loop is c * (n - i). Now the problem with determining the running time of the function is that i varies. But we can make estimates, as long as the estimates are greater than the actual value, so let's assume that the running time of the inner loop is cn. Now the inner loop executes n times, so the total running time is n * cn = cn^2. So the function is quadratic in n. - Running time of recursive algorithms: It is much harder to use the above algorithm on recursive functions, since you don't know the running time of the recursive call. So instead, start at the base case and work your way up until you can figure out the running time. Ex: int factorial2(int n) { if (n == 0) { return 1; } else { return n * factorial2(n - 1); } } The base case, n == 0, is constant. Now each recursion of this function is constant except for the recursive call, since there is only a multiplication and a subtraction. So if you recurse n times, you take n * c = cn time to compute the function. General method: a) Figure out how much work is done in each recursion, excepting the recursive call. b) Figure out how many recursions are needed. c) Running time = # recursions * time/recursion Huffman Trees: - Good explanation at http://www.crs4.it/~luigi/MPEG/huffman_tutorial.html The link is on my website.