Administrative info
  PA3 out in the next day or so

Review
  Recall the coin flipping game from last time. We flip a biased coin
  that has probability p of heads n times, and for each heads we win
  $1 and we lose $1 for each tails. We are interested in how much
  total money we win.

  In general, we determined that if we flip the coin n times, then
  W(ω) = 2 H(ω) - n, where H(ω) is the number of
  heads in ω. We abbreviate this statement as W = 2H - n.

  We then demonstrated that the distribution of H is
    Pr[H = i] = C(n, i) p^i (1-p)^(n-i)
  for integer i, 0 <= i <= n. This is a binomial distribution with
  parameters n and p, denoted by H ~ Bin(n, p).

  Once we computed the distribution of H, we computed the distribution
  of W = 2H - n. We have
    Pr[W = j] = Pr[2H - n = j]
      = Pr[H = (j+n)/2]
      = C(n, (j+n)/2) p^[(j+n)/2] (1-p)^[n-(j+n)/2]
  for integer (j+n)/2, 0 <= (j+n)/2 <= n. Solving for j, we get -n
  <= j <= n as the range of values W can take on, but with the caveat
  that it only takes on even values of n is even and odd values if n
  is odd so that (j+n)/2 is an integer.

  Recall the exam example from last time. We had n students, each of
  who receives a random exam back. We were interested in how many
  students get their own exam back. Calling this random variable X, we
  then defined X_i to be an indicator random variable that is 1 if the
  ith student gets his or her own exam back and 0 otherwise. Then X =
  X_1 + ... + X_n.

  We defined the expected value of a random variable X to
  be
    E(X) = ∑_{ω ∈ Ω} X(ω) Pr[ω].
  or equivalently
    E(X) = ∑_{a ∈ A} a * Pr[X = a],
  where A is the set of all values that X can take on.

  We determined that in the coin flipping example, E(W) = 6p - 3 when
  n = 3.

  In the example of passing back exams, for n = 3, we had
    E(X) = 3 Pr[X=3] + 1 Pr[X=1] + 0 Pr[X=0]
      = 3/6 + 1/2
      = 1.

  Suppose we roll a fair die. We calculated the expected value of N, the
  number that shows, as
    Pr[N=i] = 1/6 for 1 <= i <= 6, and
    E(X) = 1 Pr[N=1] + 2 Pr[N=2] + ... + 6 Pr[N=6]
      = 1 * 1/6 + 2 * 1/6 + ... + 6 * 1/6
      = 7/2.

  We then computed in a tedious manner that if we roll two dice, the
  expected value of their sum S is E(S) = 7.

  As a final example, suppose we pick 100 Californians at uniformly at
  random. How many Democrats do we expect out of this group, given
  that 44.5\% of Californians are Democrat? Intuitively, we'd expect
  44.5, but how can we arrive at that without computing a large
  distribution?

Linearity of Expectation
  Suppose we have two random variables X and Y. Let Z = X + Y. What is
  E(Z)? We have, from the first definition of expectation,
    E(Z) = ∑_{ω ∈ Ω} Z(ω) Pr[ω]
      = ∑_{ω} (X(ω) + Y(ω)) Pr[ω]
      = ∑_{ω} X(ω) Pr[ω] +
        ∑_{ω} Y(ω) Pr[ω]
      = E(X) + E(Y).
  Thus, E(X+Y) = E(X) + E(Y). We can similarly show that E(cX) =
  cE(X), where c is a constant. These two facts are known as
  "linearity of expectation."

  Linearity of expectation is a powerful tool for computing
  expectations. We have already seen examples of defining a random
  variable in terms of other random variables, which allows us to use
  linearity of expectation.

  Let's go back to the example of rolling two dice. Let N_1 be the
  value of the first die, N_2 the value of the second die, and S = N_1
  + N_2 the sum. Then we compute E(N_1) = E(N_2) = 7/2. Then E(S) =
  E(N_1) + E(N_2) = 7, as before. This computation, however, is much
  simpler than using the distribution of S.

  In the exam example, let us compute the distribution of X_i, which
  is 1 if the ith student gets his or her own exam back and 0
  otherwise. There are n choices of exam, only one of which is a
  match, so
    Pr[X_i=1] = 1/n
    Pr[X_i=0] = 1 - 1/n.
  What is E(X_i)? It is
    E(X_i) = Pr[X_i=1] = 1/n.

  Note that in general for an indicator random variable Y, E(Y) =
  Pr[Y=1].

  Now let us compute E(X), where X is the total number of students
  who get their own exam back. Since X = X_1 + ... + X_n, we have
    E(X) = E(X_1) + ... + E(X_n)
      = 1/n + ... + 1/n
      = 1.
  This matches what we get when n = 3. Notice that the expected number
  of students who get their own exam back is always 1, regardless of
  n! This is quite surprising.

  Let us proceed in the same manner to compute E(H), the expected
  number of heads when flipping a biased coin n times. Let H_i be an
  indicator random variable that is 1 if the ith flip is heads, 0
  otherwise. Then Pr[H_i] = p, so E(H_i) = p. Then the number of heads
  is just
    H = H_1 + ... + H_n,
  so
    E(H) = E(H_1) + ... + E(H_n)
      = p + ... + p
      = np.
  Again, this is much simpler than using the distribution of H.

  In general, for a random variable X ~ Bin(n, p), we have E(X) = np.

  In our coin flipping game, our winnings W were given by W = 2H - n.
  Thus,
    E(W) = 2 E(H) - n
      = 2np - n
      = n (2p - 1).
  (Note that the expectation of a constant E(c) is just c, so E(n) =
  n.) Plugging in n = 3, we get E(W) = 6p - 3, as before. Again, this
  method is much easier than using the distribution of W.

  Finally, how many Democrats do we expect in a group of 100 random
  Californians? Let D be the number of Democrats, D_i an indicator
  random variable if the ith person is a Democrat. Then Pr[D_i=1] =
  0.445, so E(D_i) = 0.445. Then E(D) = 44.5, as we expected.

  We could have also noticed that D ~ Bin(100, 0.445) and immediately
  concluded that E(D) = 100 * 0.445 = 44.5.

Geometric Distribution
  We have already seen one important distribution, the binomial
  distribution. We will look at two more important discrete
  distributions.

  Suppose I take a written driver's license test. Since I don't study,
  I only have a probability p of passing the test, mostly by getting
  lucky. Let T be the number of times I have to take the test before I
  pass. (Assume I can take it as many times as necessary, perhaps by
  paying a not negligible fee.) What is the distribution of T?

  (Fun fact: A South Korean woman took the test 950 times before
  passing.)
  [Note: By "before passing," we mean that she passed in the 950th
  attempt, not the 951st. We may use this phrase again.]

  Before we determine the distribution of T, we should figure out what
  the sample space of the experiment is. An outcome consists of a
  series of 0 or more failures followed by a success, since I keep
  retaking the test until I pass it. Thus, if f is a failure and c is
  passing, we get the outcomes
    Ω = {c, fc, ffc, fffc, ffffc, ...}.
  How many outcomes are there? There is no upper bound on how many
  times I will have to take the test, since I can get very unlucky and
  keep failing. So the number of outcomes is infinite!

  What is the probability of each outcome? Well, let's assume that the
  result of a test is independent each time I take it. (I really
  haven't studied, so I'm just guessing blindly each time.) Then the
  probability of passing a test is p and of failing is c, so we get
    Pr[c] = p, Pr[fc] = (1-p)p, Pr[ffc] = (1-p)^2 p, ...
  Do these probabilities add to 1? Well, their sum is
    ∑_{ω ∈ &Omega} Pr[ω]
      = ∑_{i=0}^∞ (1-p)^i p
      = p ∑_{i=0}^∞ (1-p)^i
      = p 1/(1-(1-p))   [sum of geom. series r^i is 1/r if -1 < r < 1]
      = 1.
  So this probability assignment is valid.

  We continue with this example next time.