7/25/02
Amir Kamil

Announcement:

Topic: Hashtables, Sorting


Hashtables:
- We want a lookup table that will give us O(1) insertion and
  retrieval time.
- A normal array will work if we have objects distinguished by a
  single number, but what about objects that are identified by
  n-tuples?
- We can use an n dimensional array, but the space requirements are
  huge, and it doesn't work when n doesn't have an upper limit
  (e.g. strings).
- Instead, we use a "hash code" that as uniquely as possible assigns
  an int value to an object.
- Furthermore, we use arrays of length O(n), where n is the number of
  objects we want to store in the hashtable. We place an object with
  hash code c at index c % l where l is the length of the array.
- The locations in a hashtable are called "buckets".
- What if multiple objects have the same value for c % l? Then we have
  a "collision". The simplest way to deal with collisions is to store
  linked lists at each bucket, rather than the objects
  themselves. Then we chain the object to the end of the linked list
  corresponding to its bucket.
- Ex:
   ___  ___  ___  ___  ___  ___ 
  |   ||   ||   ||   ||   ||   |  <-- hashtable
  |_|_||_|_||_|_||_|_||_|_||_|_|
    |    |    |    |    |    |  
    V    V    V    V    V    V  
  c=18 c=19      c=15 c=10 c=5    
  c=24 c=37           c=4         <-- linked lists
                      c=88        <-- c = hash code

- We want to minimize the number of collisions in our hash table, in
  order to keep the running time at O(1). For example, if we had n
  objects, all of which were in the same bucket, we'd have to go
  through the entire linked list for that bucket in order to retrieve
  an item, which would give us a running time of O(n).
- In order to minimize collisions, we need a good hash function that
  will evenly distribute objects in the hashtable.
- Example hash function for strings:

  hash code = s.charAt(0) + s.charAt(1) + s.charAt(2) + ...

  What's the problem with this hash function? Not only will all
  anagrams be given the same hash code, but even unrelated string such
  as "at" and "il" will be given the same hash code.
- The goal of a hash function is to give equal objects the same hash
  code and unequal objects as distinct as possible hash codes.
- Many objects can be identified by n-tuples of numbers. A good hash
  function on n-tuples (x[0], x[1], ... , x[n-1]) is:
  
  x[0] * a^(n-1) + x[1] * a^(n-2) + ... + x[n-2] * a + x[n-1]

  where a is a constant. This may overflow an int, since we are
  limited to 2^32 - 1, but we can ignore such overflows.

  Note that this will also work on objects that don't have an upper
  bound on n, such as strings. Is there a requirement on what a we
  choose? We don't want to choose an a equal to the array length, for
  example, since that would map all values only according to
  x[n-1]. Choosing a prime number minimizes the chance of this
  happening. G&T suggests using 33, 37, 39, or 41 for strings (which
  aren't all primes).


Sorting:
- In all these sorting algorithms, I'm assuming that there are no
  duplicate elements.
- Selection Sort:

  Algorithm:
  Given an array x,
  For i = 0 to i = x.length - 1 do:
    Find the smallest element in the subarray x[i] to x[x.length-1].
    Swap the smallest element with the element at i.

  Ex:
  Sort (14, 6, 34, 3, 64, 1, 11, 5, 97).

  (14#, 6, 34, 3, 64, 1*, 11, 5, 97)  i = #, smallest = *
                  |  
		  V
  (1, 6#, 34, 3*, 64, 14, 11, 5, 97)  i = #, smallest = *
                  |  
		  V
  (1, 3, 34#, 6, 64, 14, 11, 5*, 97)  i = #, smallest = *
                  |  
		  V
  (1, 3, 5, 6#*, 64, 14, 11, 34, 97)  i = #, smallest = *
                  |  
		  V
  (1, 3, 5, 6, 64#, 14, 11*, 34, 97)  i = #, smallest = *
                  |  
		  V
  (1, 3, 5, 6, 11, 14#*, 64, 34, 97)  i = #, smallest = *
                  |  
		  V
  (1, 3, 5, 6, 11, 14, 64#, 34*, 97)  i = #, smallest = *
                  |  
		  V
  (1, 3, 5, 6, 11, 14, 34, 64#*, 97)  i = #, smallest = *
                  |  
		  V
  (1, 3, 5, 6, 11, 14, 34, 64, 97#*)  i = #, smallest = *
                  |  
		  V
  (1, 3, 5, 6, 11, 14, 34, 64, 97)

  Analysis:
  Each iteration of the loop takes n/2 time, on average, to search
  through the rest of the array. n iterations are required, so this
  sort is O(n^2).

- Insertion Sort:

  Algorithm:
  Given a linked list x,
  Create a new linked list y
  For i = 0 to i = x.length() - 1 do:
    Find the first element in y that is greater than the head of x
    Insert the head of x in front of that element in y
    Remove the head of x

  Ex:
  Sort (14, 6, 34, 3, 64, 1, 11, 5, 97).

  x: (14, 6, 34, 3, 64, 1, 11, 5, 97)   y: ()
  x: (6, 34, 3, 64, 1, 11, 5, 97)   y: (14)
  x: (34, 3, 64, 1, 11, 5, 97)   y: (6, 14)
  x: (3, 64, 1, 11, 5, 97)   y: (6, 14, 34)
  x: (64, 1, 11, 5, 97)   y: (3, 6, 14, 34)
  x: (1, 11, 5, 97)   y: (3, 6, 14, 34, 64)
  x: (11, 5, 97)   y: (1, 3, 6, 14, 34, 64)
  x: (5, 97)   y: (1, 3, 6, 11, 14, 34, 64)
  x: (97)   y: (1, 3, 5, 6, 11, 14, 34, 64)
  x: ()   y: (1, 3, 5, 6, 11, 14, 34, 64, 97)

  Analysis:
  Each iteration of the loop takes n/2 time, on average, to search
  through y for the first element greater than the head of x. n
  iterations are required, so this sort is O(n^2).

- Heap Sort:

  Algorithm:
  Given an array x,
  Create a new array y.
  For i = x.length - 1 to i = 0 do:
    reheapifyDown(x, i)
  For i = 0 to i = x.length - 1 do:
    y[i] = removeMin(x)

  Analysis:
  If you recall from lab 8, heapifying the array using reheapifyDown()
  takes n time. Each removeMin() operation takes log n time, and n
  removeMin() operations are required, so running time is 
  max(n, n log n) = n log n. Note that if we had heapified using
  reheapifyUp(), which takes n log n time, the asymptotic running time
  would have been the same.

- Merge Sort:
  
  Algorithm:
  Given an array x,
  If x.length == 1 return x
  Else:
    Split the array into two pieces y and z, the first ceiling(x/2)
    elements in the y and the remaining elements in z.
    Merge sort y and z.
    Return Merge y and z.

  Merge:
  Given arrays y, z,
  Create a new array u
  Let i = j = 0 
  While i < y.length and j < z.length do:
    If y[i] < z[j]:
      u[i+j] = y[i]
      i = i + 1
    Else:
      u[i+j]
      j = j + 1
  If i < y.length
    Append remaining elements of y to u
  Else
    Append remaining elements of z to u
  Return u

  Ex:
  Sort (14, 6, 34, 3, 64, 1, 11, 5, 97).

  (14, 6, 34, 3, 64, 1, 11, 5, 97)
                 
                 
  (14, 6, 34, 3, 64)                   (1, 11, 5, 97)
                 
                 
  (14, 6, 34)       (3, 64)            (1, 11)      (5, 97)
                 
                 
  (14, 6)   (34)    (3)  (64)          (1)  (11)    (5)  (97)
                 
                 
  (14) (6)  (34)    (3)  (64)          (1)  (11)    (5)  (97)
                 

  (6, 14)   (34)    (3)  (64)          (1)  (11)    (5)  (97)
                 
                 
  (6, 14, 34)       (3, 64)            (1, 11)      (5, 97)
                 
                 
  (3, 6, 14, 34, 64)                   (1, 5, 11, 97)
                 
                 
  (1, 3, 5, 6, 11, 14, 34, 64, 97)
                
  Analysis:
  As you can see from the above picture, there are O(log n) levels,
  each with n elements. Merge is linear, so each level takes n time to
  merge. So the running time of this sort is O(n log n).

- Quick Sort:

  Algorithm:
  Given a linked list x,
  If x.length() <= 1 return x
  Else:
    Choose a "pivot" element in x
    Create new linked lists y and z
    Place all elements less than the pivot in y
    Place all elements greater than the pivot in z
    Quick sort y and z.
    Return y + pivot + z.

  Ex:
  Sort (14, 34, 3, 64, 1, 11, 5, 97), using the last element in an
  array as the pivot.

  (14, 34, 3, 64, 1, 11, 5, 97)
                 
                 
  (14, 34, 3, 64, 1, 11, 5)                     97     ()


  (3, 1)      5    (14, 34, 64, 11)             97     ()


  () 1 (3)    5    ()  11  (14, 34, 64)         97 ()


  () 1 (3)    5    ()  11  (14, 34) 64 ()       97 ()


  () 1 (3)    5    ()  11  (14)34() 64 ()       97 ()


  () 1 (3)    5    ()  11  (14, 34) 64 ()       97 ()


  () 1 (3)    5    ()  11  (14, 34, 64)         97 ()


  (1, 3)      5    (11, 14, 34, 64)             97     ()


  (1, 3, 5, 11, 14, 34, 64)                     97     ()


  (1, 3, 5, 11, 14, 34, 64, 97)

  Analysis:
  In each recursive step, we expect to partition the items into two
  equally sized sets. In this case, there would be n log n levels, and
  since it takes n time to partition, the expected running time is 
  O(n log n). However, we chose the last item as the pivot. In the
  case of an already sorted list, in each step, one of our partitions
  would be empty, so we would require n levels to do the sort. So the
  worst case running time is O(n^2). Choosing a random pivot makes it
  more likely that sorting an arbitrary list will be closer to the
  expected time rather than the worst case time.
  
- Modifications:
  If we want to deal with duplicate elements, we can modify the above
  sorts. In the case of quick sort, we can have an equal partition
  rather than just a pivot. With a change as trivial as converting a <
  or a > to a <= or a >=, we can make the other sorts work as well on
  duplicate elements.