Administrative info
  HW6 due tomorrow
  MT2 next Tuesday
    Same location and policies as MT1
    Cover through polling/LLN (Wednesday)

Review
  We have already seen one important distribution, the binomial
  distribution. A random variable X ~ Bin(n, p) has the distribution
    Pr[X = i] = C(n, i) p^i (1-p)^(n-i)
  for integer i, 0 <= i <= n. This distribution arises whenever we
  have a fixed number of trials n, the trials are mutually
  independent, the probability of success of any one trial is p, and
  we are counting the number of successes.

  We aslo computed the expectation of a binomial distribution using
  indicator random variables:
    X = X_1 + ... + X_n
    X_i = { 1 if the ith trial is successful
          { 0 otherwise
    E(X_i) = Pr[X_i = 1] = p
    E(X) = E(X_1) + ... + E(X_n)
      = np.

  Now we turn our attention to two more important discrete
  distributions.

Geometric Distribution
  Suppose I take a written driver's license test. Since I don't study,
  I only have a probability p of passing the test, mostly by getting
  lucky. Let T be the number of times I have to take the test before I
  pass. (Assume I can take it as many times as necessary, perhaps by
  paying a not negligible fee.) What is the distribution of T?

  (Fun fact: A South Korean woman took the test 950 times before
  passing.)
  [Note: By "before passing," we mean that she passed in the 950th
  attempt, not the 951st. We may use this phrase again.]

  Before we determine the distribution of T, we should figure out what
  the sample space of the experiment is. An outcome consists of a
  series of 0 or more failures followed by a success, since I keep
  retaking the test until I pass it. Thus, if f is a failure and c is
  passing, we get the outcomes
    Ω = {c, fc, ffc, fffc, ffffc, ...}.
  How many outcomes are there? There is no upper bound on how many
  times I will have to take the test, since I can get very unlucky and
  keep failing. So the number of outcomes is infinite!

  What is the probability of each outcome? Well, let's assume that the
  result of a test is independent each time I take it. (I really
  haven't studied, so I'm just guessing blindly each time.) Then the
  probability of passing a test is p and of failing is c, so we get
    Pr[c] = p, Pr[fc] = (1-p)p, Pr[ffc] = (1-p)^2 p, ...
  Do these probabilities add to 1? Well, their sum is
    ∑_{ω ∈ &Omega} Pr[ω]
      = ∑_{i=0}^∞ (1-p)^i p
      = p ∑_{i=0}^∞ (1-p)^i
      = p 1/(1-(1-p))
                    (sum of geom. series r^i is 1/(1-r) if -1 < r < 1)
      = 1.
  So this probability assignment is valid.

  Since the event T = i has only the single outcome f^{i-1}c, we get
    Pr[T=1] = p, Pr[T=2] = (1-p)p, Pr[T=3] = (1-p)^2 p, ...
  as the distribution of T, and the probabilities sum to 1, as
  required for a random variable.

  The distribution of T is known as a "geometric distribution" with
  parameter p, T ~ Geom(p). This arises anytime we have a sequence of
  independent trials, each of which has probability p of succes, and
  we want to know when the first success occurs. (This is unlike the
  binomial distribution, when we wanted to know how many success occur
  in a fixed number n of independent trials.)

  Now how many times can I expect to take the test before passing? We
  want E(T). We get
    E(T) = p + 2(1-p)p + 3(1-p)^2 p + ...
      = ∑_{i=1}^∞ i(1-p)^{i-1}.
  This isn't a pure geometric series, so directly computing the sum is
  harder.

  Let's use another method. It turns out that for any random variable
  X that only takes on values in N,
    E(X) = Pr[X >= 1] + Pr[X >= 2] + Pr[X >= 3] + ...
      = ∑_{i=1}^∞ Pr[X >= i].
  Proof:
    Let p_i = Pr[X = i]. Then by definition,
    E(X) = 0 p_0 + 1 p_1 + 2 p_2 + 3 p_3 + 4 p_4 + ...
      = p_1 +
       (p_2 + p_2) +
       (p_3 + p_3 + p_3) +
       (p_4 + p_4 + p_4 + p_4) +
        ...
      = (p_1 + p_2 + p_3 + p_4 + ...) +
        (p_2 + p_3 + p_4 + ...) +
        (p_3 + p_4 + ...) +
         ...                   (combining columns from previous step)
      = Pr[X >= 1] + Pr[X >= 2] + Pr[X >= 3] + Pr[X >= 4] + ...

  Now what is Pr[T >= i]? This is the probability that I fail the
  first i-1 tests, so
    Pr[T >= i] = (1-p)^(i-1).
  Then
    E(T) = ∑_{i=1}^∞ Pr[X >= i]
      = ∑_{i=1}^∞ (1-p)^(i-1)
      = ∑_{j=0}^∞ (1-p)^j                 (with j = i - 1)
      = 1/(1-(1-p))                              (geometric series)
      = 1/p.

  So I expect to take the test 1/p times before passing.

  Here's another way to calculate E(T).
         E(T) = p + 2p(1-p) + 3p(1-p)^2 + 4p(1-p)^3 + ...
    (1-p)E(T) =      p(1-p) + 2p(1-p)^2 + 3p(1-p)^3 + ...
        pE(T) = p +  p(1-p) +  p(1-p)^2 +  p(1-p)^3 + ...
              = 1
         E(T) = 1/p
  In the second line, we multiplied E(T) by (1-p) and added some
  whitespace to line up terms with the previous line. Then we
  subtracted (1-p)E(T) from E(T) to get the third line. The resulting
  right-hand side is the sum of the probabilities of each event T = i,
  so it must be 1.

  To summarize, for a random variable X ~ Geom(p), we've computed
  (1) Pr[X = i] = (1-p)^(i-1) p
  (2) Pr[X >= i] = (1-p)^(i-1)
  (3) E(X) = 1/p.

  Other examples of geometrically distributed random variables are the
  number of runs before a system fails, the number of shots that must
  be taken before hitting a target, and the number of coin flips
  before heads appears.

Coupon Collector Redux
  Recall the coupon collector problem. We buy cereal boxes, each of
  which contains a baseball card for one of the n Giants players. How
  many do I expect to buy before I get a Panda card?

  Let P be the number of boxes I buy before I get the Panda. Then,
  each time I buy a box, I have 1/n chance of getting the Panda, and
  the boxes are independent. So P ~ Geom(1/n), and E(P) = n.

  Now suppose I want the entire team? Let T be the number of boxes I
  buy to get the entire team. It is tempting to define a separate random
  variable for each player,
    P = # of boxes to get the Panda
    B = # of boxes to get the Beard
    F = # of boxes to get the Freak
    ...
  but T ≠ P + B + F + ... (Can you see why? If we just consider
  these three players and it takes me 1 box to get the Panda, 2 to get
  the Beard, 3 to get the Freak, then T = 3, but P + B + F = 1 + 2 + 3
  = 6.) So we need another approach.

  Let's instead define random variables P_i as the number of boxes it
  takes to get a new player after I get the (i-1)th player. (In
  the above example, P_1 = P_2 = P_3 = 1, so T = P_1 + P_2 + P_3.)
  Then it is the case that T = P_1 + ... + P_n, and we can appeal to
  linearity of expectation.

  Now E(P_i) is not constant for all i. In particular, I always get a
  new player in the first box, so Pr[P_1 = 1] = 1 and E(P_1) = 1. But
  then for the second box, I can get the same player as the first, so
  Pr[P_2 = 1] ≠ 1.

  Note, however, that I do have probability (n-1)/n of getting a new
  player, and P_2 is the first occurrence of a new player. So P_2 ~
  Geom((n-1)/n), and E(P_2) = n/(n-1).

  By the same reasoning, P_i ~ Geom((n-i+1)/n), so E(P_i) = n/(n-i+1).
  So by linearity of expectation,
    E(T) = n/n + n/(n-1) + n/(n-2) + ... + n/2 + n/1
      = n ∑_{i=1}^n 1/i.
  The above sum has a good approximation
    ∑_{i=1}^n 1/i ≈ ln(n) + γ,
  where γ ≈ 0.5772 is Euler's constant. So we get
    E(T) ≈ n(ln(n) + 0.58).

  Recall our previous result, were we computed that in order to have a
  50% chance of getting all n cards, we needed to buy n ln(2n) =
  n(ln(n) + ln(2)) ≈ n(ln(n) + 0.69) boxes.

  It is not the case in general that Pr[X > E(X)] ≈ 1/2. The
  simplest counter example is an indicator random variable Y, Pr[Y =
  1] = p. Then E(Y) = p, so Pr[Y > E(Y)] = p ≠ 1/2. So the two
  results for coupon collecting are not directly comparable.

Poisson Distribution
  Suppose we throw n balls into n/λ bins, where n is large and
  λ is a constant. We are interested in how many balls land in
  bin 1. Call this X, then X ~ Bin(n, λ/n), and E(X) =
  λ. In more detail, the distribution is
    Pr[X = i] = C(n, i) (λ/n)^i (1 - λ/n)^(n-i),
  for 0 <= i <= n.

  We know n is large, so let's approximate this distribution. Let's
  define p_i ≡ Pr[X = i]. Then we have
    p_0 = Pr[X = 0] = (1 - λ/n)^n.
  Recall the Taylor series for e^x:
    e^x = 1 + x + x^2/2! + x^3/3! + ...
  Plugging in x = -y, we get e^{-y} ≈ 1 - y, so
    (1 - λ/n) ≈ e^{-λ/n}
    (1 - λ/n)^n ≈ (e^{-λ/n})^n
      = e^{-λ}.
  Thus, p_0 ≈ e^{-λ}.

  What about p_i in the general case? Let's look at the ratio
  p_i/p_{i-1}.
    p_i/p_{i-1} = [C(n,i) (λ/n)^i (1-λ/n)^{n-i}]/
                  [C(n,i-1) (λ/n)^{i-1} (1-λ/n)^{n-i+1}]
      = [C(n,i) λ/n]/
        [C(n,i-1) (1-λ/n)]
      = [C(n,i) λ/n]/
        [C(n,i-1) (n-λ)/n]
      = C(n,i)/C(n,i-1) λ/(n-λ).
  Now let's look at the ratio C(n,i)/C(n,i-1). We have
    C(n,i)/C(n,i-1) = (n!/[i!(n-i)!])/
                      (n!/[(i-1)!(n-i+1)!])
      = (i-1)!/i! (n-i+1)!/(n-i)!
      = 1/i (n-i+1)
      = (n-i+1)/i.
  Plugging this in to our expression for p_i/p_{i-1}, we get
    p_i/p_{i-1} = (n-i+1)/i λ/(n-λ)
      = (n-i+1)/(n-λ) λ/i.
  Now in the limit n -> ∞, (n-i+1)/(n-λ) -> 1, so
    p_i/p_{i-1} ≈ λ/i,
    p_i ≈ p_{i-1} λ/i.

  This gives us a recurrence:
    p_0 = exp(-λ)
    p_1 = exp(-λ) λ
    p_2 = exp(-λ) λ^2/2
    p_3 = exp(-λ) λ^3/(2*3)
    p_4 = exp(-λ) λ^4/(2*3*4)
    ...
    p_i = exp(-λ) λ^i/i!
  So we get a new distribution
    Pr[X = i] = (λ^i)/i! e^{-λ}, i ∈ N.
  This is a "Poisson distribution" with parameter λ, and we
  write X ~ Poiss(λ). (Note that though in the original
  binomial distribution, i is restricted to 0 <= i <= n, here it is
  not.)

  Let's check to make sure this is a proper distribution. We have
    ∑_{i=0}^∞ p_i
      = ∑_{i=0}^∞ (λ^i)/i! e^{-λ}
      = e^{-λ} ∑_{i=0}^∞ (λ^i)/i!
      = e^{-λ} e^{λ}    (using the Taylor series above)
      = 1.

  Now let's compute E(X):
    E(X) = ∑_{i=0}^∞ i (λ^i)/i! e^{-λ}
      = e^{-λ} ∑_{i=0}^∞ i (λ^i)/i!
      = e^{-λ} ∑_{i=1}^∞ i (λ^i)/i!
      = e^{-λ} ∑_{i=1}^∞ (λ^i)/(i-1)!
      = λ e^{-λ}
          ∑_{i=1}^∞ (λ^(i-1))/(i-1)!
      = λ e^{-λ}
          ∑_{j=0}^∞ (λ^j)/j!        (with j = i - 1)
      = λ e^{-λ} e^{λ}
      = λ.
  This is the same as that of the original binomial disribution.

  The Poisson distribution is widely used for modeling rare events. It
  is a good approximation of the binomial distribution when n >= 20
  and p <= 0.05, and a very good approximation when n >= 100 and np
  <= 10.