7/25/02 Amir Kamil Announcement: Topic: Hashtables, Sorting Hashtables: - We want a lookup table that will give us O(1) insertion and retrieval time. - A normal array will work if we have objects distinguished by a single number, but what about objects that are identified by n-tuples? - We can use an n dimensional array, but the space requirements are huge, and it doesn't work when n doesn't have an upper limit (e.g. strings). - Instead, we use a "hash code" that as uniquely as possible assigns an int value to an object. - Furthermore, we use arrays of length O(n), where n is the number of objects we want to store in the hashtable. We place an object with hash code c at index c % l where l is the length of the array. - The locations in a hashtable are called "buckets". - What if multiple objects have the same value for c % l? Then we have a "collision". The simplest way to deal with collisions is to store linked lists at each bucket, rather than the objects themselves. Then we chain the object to the end of the linked list corresponding to its bucket. - Ex: ___ ___ ___ ___ ___ ___ | || || || || || | <-- hashtable |_|_||_|_||_|_||_|_||_|_||_|_| | | | | | | V V V V V V c=18 c=19 c=15 c=10 c=5 c=24 c=37 c=4 <-- linked lists c=88 <-- c = hash code - We want to minimize the number of collisions in our hash table, in order to keep the running time at O(1). For example, if we had n objects, all of which were in the same bucket, we'd have to go through the entire linked list for that bucket in order to retrieve an item, which would give us a running time of O(n). - In order to minimize collisions, we need a good hash function that will evenly distribute objects in the hashtable. - Example hash function for strings: hash code = s.charAt(0) + s.charAt(1) + s.charAt(2) + ... What's the problem with this hash function? Not only will all anagrams be given the same hash code, but even unrelated string such as "at" and "il" will be given the same hash code. - The goal of a hash function is to give equal objects the same hash code and unequal objects as distinct as possible hash codes. - Many objects can be identified by n-tuples of numbers. A good hash function on n-tuples (x[0], x[1], ... , x[n-1]) is: x[0] * a^(n-1) + x[1] * a^(n-2) + ... + x[n-2] * a + x[n-1] where a is a constant. This may overflow an int, since we are limited to 2^32 - 1, but we can ignore such overflows. Note that this will also work on objects that don't have an upper bound on n, such as strings. Is there a requirement on what a we choose? We don't want to choose an a equal to the array length, for example, since that would map all values only according to x[n-1]. Choosing a prime number minimizes the chance of this happening. G&T suggests using 33, 37, 39, or 41 for strings (which aren't all primes). Sorting: - In all these sorting algorithms, I'm assuming that there are no duplicate elements. - Selection Sort: Algorithm: Given an array x, For i = 0 to i = x.length - 1 do: Find the smallest element in the subarray x[i] to x[x.length-1]. Swap the smallest element with the element at i. Ex: Sort (14, 6, 34, 3, 64, 1, 11, 5, 97). (14#, 6, 34, 3, 64, 1*, 11, 5, 97) i = #, smallest = * | V (1, 6#, 34, 3*, 64, 14, 11, 5, 97) i = #, smallest = * | V (1, 3, 34#, 6, 64, 14, 11, 5*, 97) i = #, smallest = * | V (1, 3, 5, 6#*, 64, 14, 11, 34, 97) i = #, smallest = * | V (1, 3, 5, 6, 64#, 14, 11*, 34, 97) i = #, smallest = * | V (1, 3, 5, 6, 11, 14#*, 64, 34, 97) i = #, smallest = * | V (1, 3, 5, 6, 11, 14, 64#, 34*, 97) i = #, smallest = * | V (1, 3, 5, 6, 11, 14, 34, 64#*, 97) i = #, smallest = * | V (1, 3, 5, 6, 11, 14, 34, 64, 97#*) i = #, smallest = * | V (1, 3, 5, 6, 11, 14, 34, 64, 97) Analysis: Each iteration of the loop takes n/2 time, on average, to search through the rest of the array. n iterations are required, so this sort is O(n^2). - Insertion Sort: Algorithm: Given a linked list x, Create a new linked list y For i = 0 to i = x.length() - 1 do: Find the first element in y that is greater than the head of x Insert the head of x in front of that element in y Remove the head of x Ex: Sort (14, 6, 34, 3, 64, 1, 11, 5, 97). x: (14, 6, 34, 3, 64, 1, 11, 5, 97) y: () x: (6, 34, 3, 64, 1, 11, 5, 97) y: (14) x: (34, 3, 64, 1, 11, 5, 97) y: (6, 14) x: (3, 64, 1, 11, 5, 97) y: (6, 14, 34) x: (64, 1, 11, 5, 97) y: (3, 6, 14, 34) x: (1, 11, 5, 97) y: (3, 6, 14, 34, 64) x: (11, 5, 97) y: (1, 3, 6, 14, 34, 64) x: (5, 97) y: (1, 3, 6, 11, 14, 34, 64) x: (97) y: (1, 3, 5, 6, 11, 14, 34, 64) x: () y: (1, 3, 5, 6, 11, 14, 34, 64, 97) Analysis: Each iteration of the loop takes n/2 time, on average, to search through y for the first element greater than the head of x. n iterations are required, so this sort is O(n^2). - Heap Sort: Algorithm: Given an array x, Create a new array y. For i = x.length - 1 to i = 0 do: reheapifyDown(x, i) For i = 0 to i = x.length - 1 do: y[i] = removeMin(x) Analysis: If you recall from lab 8, heapifying the array using reheapifyDown() takes n time. Each removeMin() operation takes log n time, and n removeMin() operations are required, so running time is max(n, n log n) = n log n. Note that if we had heapified using reheapifyUp(), which takes n log n time, the asymptotic running time would have been the same. - Merge Sort: Algorithm: Given an array x, If x.length == 1 return x Else: Split the array into two pieces y and z, the first ceiling(x/2) elements in the y and the remaining elements in z. Merge sort y and z. Return Merge y and z. Merge: Given arrays y, z, Create a new array u Let i = j = 0 While i < y.length and j < z.length do: If y[i] < z[j]: u[i+j] = y[i] i = i + 1 Else: u[i+j] j = j + 1 If i < y.length Append remaining elements of y to u Else Append remaining elements of z to u Return u Ex: Sort (14, 6, 34, 3, 64, 1, 11, 5, 97). (14, 6, 34, 3, 64, 1, 11, 5, 97) (14, 6, 34, 3, 64) (1, 11, 5, 97) (14, 6, 34) (3, 64) (1, 11) (5, 97) (14, 6) (34) (3) (64) (1) (11) (5) (97) (14) (6) (34) (3) (64) (1) (11) (5) (97) (6, 14) (34) (3) (64) (1) (11) (5) (97) (6, 14, 34) (3, 64) (1, 11) (5, 97) (3, 6, 14, 34, 64) (1, 5, 11, 97) (1, 3, 5, 6, 11, 14, 34, 64, 97) Analysis: As you can see from the above picture, there are O(log n) levels, each with n elements. Merge is linear, so each level takes n time to merge. So the running time of this sort is O(n log n). - Quick Sort: Algorithm: Given a linked list x, If x.length() <= 1 return x Else: Choose a "pivot" element in x Create new linked lists y and z Place all elements less than the pivot in y Place all elements greater than the pivot in z Quick sort y and z. Return y + pivot + z. Ex: Sort (14, 34, 3, 64, 1, 11, 5, 97), using the last element in an array as the pivot. (14, 34, 3, 64, 1, 11, 5, 97) (14, 34, 3, 64, 1, 11, 5) 97 () (3, 1) 5 (14, 34, 64, 11) 97 () () 1 (3) 5 () 11 (14, 34, 64) 97 () () 1 (3) 5 () 11 (14, 34) 64 () 97 () () 1 (3) 5 () 11 (14)34() 64 () 97 () () 1 (3) 5 () 11 (14, 34) 64 () 97 () () 1 (3) 5 () 11 (14, 34, 64) 97 () (1, 3) 5 (11, 14, 34, 64) 97 () (1, 3, 5, 11, 14, 34, 64) 97 () (1, 3, 5, 11, 14, 34, 64, 97) Analysis: In each recursive step, we expect to partition the items into two equally sized sets. In this case, there would be n log n levels, and since it takes n time to partition, the expected running time is O(n log n). However, we chose the last item as the pivot. In the case of an already sorted list, in each step, one of our partitions would be empty, so we would require n levels to do the sort. So the worst case running time is O(n^2). Choosing a random pivot makes it more likely that sorting an arbitrary list will be closer to the expected time rather than the worst case time. - Modifications: If we want to deal with duplicate elements, we can modify the above sorts. In the case of quick sort, we can have an equal partition rather than just a pivot. With a change as trivial as converting a < or a > to a <= or a >=, we can make the other sorts work as well on duplicate elements.