Administrative info
  HW7 out, due Monday
  MT2 next Tuesday
    Same location and policies as MT1
    Cover through polling/LLN (Wednesday)

Review
  We have now seen three important distributions. The first is the
  binomial distribution. A random variable X ~ Bin(n, p) has the
  distribution
    Pr[X = i] = C(n, i) p^i (1-p)^(n-i)
  for integer i, 0 <= i <= n. This distribution arises whenever we
  have a fixed number of trials n, the trials are mutually
  independent, the probability of success of any one trial is p, and
  we are counting the number of successes. The expectation of X is
    E(X) = np.

  The second is the geometric distribution. A random variable Y ~
  Geom(p) has the distribution
    Pr[Y = i] = p(1-p)^(i-1)
  for i ∈ Z^+. This distribution arises whenever we have
  independent trials, the probability of success of any one trial is
  p, and we are interested in the first success. The expectation of Y
  is
    E(Y) = 1/p.

  The third is the Poisson distribution. A random variable Z ~
  Poiss(λ) has the distribution
    Pr[Z = i] = (λ^i)/i! e^{-λ}
  for i ∈ N. This distribution is the limit of the binomial
  distribution when n is large and p is small. It is used to model the
  occurrence of rare events. The expectation of Z is
    E(Z) = λ.

Poisson Distribution
  The Poisson distribution is widely used for modeling rare events. It
  is a good approximation of the binomial distribution when n >= 20
  and p <= 0.05, and a very good approximation when n >= 100 and np
  <= 10.

  EX: Suppose a web server gets an average of 100K requests a day.
      Each request takes 1 second to handle. How many servers are
      needed to handle requests?
  ANS: The website has an unknown number of customers n, and there is
       a tiny probability p of each person making a request in any 1
       second time period. Thus, the rare event is a person choosing
       to make a request, and we can use the Poisson distribution to
       model this situation. (We don't actually know n or p, so we
       couldn't use the binomial distribution even if we wanted to.)

       Since there are 100K requests a day on average, the average
       number of requests in a 1 second time period is
         λ = 100000/(24*3600) ≈ 1.2.
       Unlike n and p, this can be measured directly, allowing us to
       use the Poisson distribution. Let R be the number of requests
       in a 1 second period. Then R ~ Poiss(1.2), and Pr[R = i] =
       (λ^i)/i! e^{-λ}.

       Plugging in λ = 1.2, we get the following values:
         i     Pr[R = i]     Pr[R <= i]
         0       0.301          0.301
         1       0.361          0.662
         2       0.217          0.879
         3       0.087          0.966
         4       0.026          0.992
         5       0.006          0.999.
       So if we have 5 servers, we can handle all requests without
       overloading the servers 99.9% of the time.

       (Note that we assumed a uniform distribution of requests over
       the entire day. If this is not the case, we can measure the
       average number of requests in the busiest 1 second time period
       and use this as λ. The rest of our analysis will be the
       same.)

Variance
  Consider a random walk: I flip a (fair) coin, and if it is heads, I
  take a step to the right, but if it is tails, I take a step to the
  left. (This models many situations: a drunken sailor, the value of
  my stock account, our coin flipping game from a previous lecture.)
  How far from the starting point can I expect to be after n flips?

  Let X_i be a random variable (not an indicator r.v!) that is +1 if
  the ith flip is heads, -1 if it is tails. Let Y be my position after
  n flips. Then
    X_i = {1 with pr. 1/2, -1 with pr. 1/2}
    Y = X_1 + ... + X_n
  What is E(Y)? We have
    E(X_i) = 0
    E(Y) = E(X_1) + ... + E(X_n)
      = 0.
  So I can expect to be back where I started.

  This isn't, however, exactly what the question asked. We wanted to
  know our distance from the starting point, not where we end
  up. What we actually want to know is E(|Y|).

  Unfortunately, the random variable |Y| is difficult to work with.
  So let's work with Y^2 instead, which will always be positive. Then
  we will take a square root at the end to learn something about how
  far we typically are from the starting point.

  (Note that it is not true that √{E(Z^2)} = E(|Z|). As a simple
  counterexample, consider an indicator random variable Z with Pr[Z =
  1] = p. Then E(|Z|) = E(Z) = p, but E(Z^2) = p, so √{E(Z^2)} =
  √{p} ≠ E(|Z|). We will see later how to relate |Z| and Z^2.)

  We have
    E(Y^2) = E((X_1 + ... + X_n)^2)
      = E(∑_{i,j} X_i X_j)
      = ∑_{i,j} E(X_i X_j).
  In the above summations, i,j are in the range 1 <= i,j <= n, so
  there are n^2 terms.

  What is E(X_i X_j)? There are two cases:
  (1) i = j
      Then E(X_i X_j) = E(X_i^2) = 1, since X_i^2 is always 1.
  (2) i ≠ j
      Let's enumerate the possiblities for X_i X_j:
        X_i    X_j    X_i X_j    prob.
         1      1        1        1/4
         1     -1       -1        1/4
        -1      1       -1        1/4
        -1     -1        1        1/4
      In the last column, we used the fact that different coin flips
      are independent, so the events X_i = a and X_j = b are
      independent.

      Putting this together, we get
        Pr[X_i X_j = 1] = 1/2
        Pr[X_i X_j = -1] = 1/2
        E(X_i X_j) = 0.

  In our summation, there are n terms that fall under case (1) and
  n^2 - n that fall under case (2), so we get
    E(Y^2) = n * 1 + (n^2 - n) * 0
      = n.

  This is called the "variance" of Y, and it tells us something about
  the spread of the random variable Y.

  More generally, for a random variable Z with arbitrary expectation
  E(Z) = μ, we define the variance to be
    Var(Z) = E((Z - μ)^2).
  It tells us something about the deviation of Z from its mean.

  The "standard deviation" of Z is
    σ(Z) = √{Var(Z)},
  which in some sense undoes the square in the variance.

  (Why do we have both variance and standard deviation? Variance is
  easier to work with, but standard deviation is on the same scale as
  the random variable, so it gives us a better idea about the typical
  deviation from the mean.)

  In the random walk, σ(Y) = √{n}.

  An alternative expression for variance is
    Var(X) = E(X^2) - μ^2.
  Proof:
    Var(X) = E((X - μ)^2)
      = E(X^2 - 2Xμ + μ^2)
      = E(X^2) - 2μE(X) + μ^2
      = E(X^2) - 2μ^2 + μ^2
      = E(X^2) - μ^2.
  In the third step above, we used linearity of expectation.

  Let's do some more examples.

  Uniform distribution
    Let X be a random variable with uniform distribution in 1,...,n.
    Then
      μ = E(X) = 1/n (1 + ... + n) = 1/n n(n+1)/2 = (n+1)/2
      μ^2 = (n+1)^2/4 = 3(n+1)^2/12
      E(X^2) = 1/n (1 + 4 + ... + n^2)
        = 1/n ∑_{i=1}^n i^2
        = 1/n n(n+1)(2n+1)/6
        = (n+1)(2n+1)/6 = 2(n+1)(2n+1)/12
      Var(X) = E(X^2) - μ^2
        = 2(n+1)(2n+1)/12 - 3(n+1)^2/12
        = (n+1)/12 (4n+2 - 3n-3)
        = (n+1)(n-1)/12
        = (n^2-1)/12.

    Compare this variance to that of the random walk; this is on the
    order of n^2, while that of the random walk was on the order of n.
    This should make sense, since in the case of the random walk, it's
    much more likely to be closer to the mean than further, unlike in
    a uniform distribution. (The probability "mass" is concentrated
    near the mean, while in a uniform distribution, it is spread out.)

    EX: Let X be the result of a roll of a fair die. What is Var(X)?
    ANS: Var(X) = (6^2-1)/12 = 35/12
         σ(X) ≈ 1.7.

  Binomial distribution
    Let X ~ Bin(n, p). Then we proceed as in the random walk. Let X_i
    be an indicator random variable that is 1 if the ith trial
    succeeds. Then
      X = X_1 + ... + X_n
      E(X^2) = E(∑_{i,j} X_i X_j)
      = ∑_{i,j} E(X_i X_j).
    For E(X_i X_j), we have two cases.
    (1) i = j
        Then E(X_i X_j) = E(X_i^2) = p, since Pr[X_i^2 = 1] = p.
    (2) i ≠ j
        Let's enumerate the possiblities for X_i X_j:
          X_i    X_j    X_i X_j    prob.
           1      1        1        p^2
           1      0        0       p(1-p)
           0      1        0       p(1-p)
           0      0        0      (1-p)^2
        In the last column, we used the fact that different coin flips
        are independent. Thus, Pr[X_i X_j = 1] = p^2, and E(X_i X_j) =
        p^2.
    There are n terms in the summation that fall under case (1), n^2 -
    n that fall under case (2), so we get
      E(X^2) = np + (n^2-n)p^2
        = np + n^2 p^2 - np^2
        = n^2 p^2 + np(1-p).
    Then
      Var(X) = E(X^2) - E(X)^2
        = n^2 p^2 + np(1-p) - n^2 p^2
        = np(1-p).

  Geometric distribution
    Let X ~ Geom(p). Then
           E(X^2) = p + 4p(1-p) + 9p(1-p)^2 + 16p(1-p)^3 + ...
    Multiplying this by (1-p), we get
      (1-p)E(X^2) =      p(1-p) + 4p(1-p)^2 +  9p(1-p)^3 + ...
    Subtracting, we get
          pE(X^2) = p + 3p(1-p) + 5p(1-p)^2 +  7p(1-p)^3 + ...
                = 2[p + 2p(1-p) + 3p(1-p)^2 +  4p(1-p)^3 + ...]
                 - [p +  p(1-p) +  p(1-p)^2 +   p(1-p)^3 + ...]
    The first sum is just E(X), and the second is the sum of the
    probabilities of each of the outcomes, so it is 1. Thus,
      pE(X^2) = 2E(X) - 1
        = 2/p - 1
        = (2-p)/p
      E(X^2) = (2-p)/p^2.
    Then
      Var(X) = E(X^2) - E(X)^2
        = (2-p)/p^2 - 1/p^2
        = (1-p)/p^2.

  Here are some useful facts about variance.
  (1) Var(cX) = c^2 Var(X), where c is a constant.
  (2) Var(X+c) = Var(X), where c is a constant.
      EX: In the random walk, we have Y = 2H - n, where H is the
          number of heads. (We showed this in a previous lecture.) So
          Var(Y) = 4Var(H). We've computed Var(H) = np(1-p), so Var(Y)
          = 4np(1-p). For a fair coin, p = 1/2, and we get Var(Y) =
          4n(1/2)(1/2) = n, as before.
  (3) Var(X+Y) = Var(X) + Var(Y) if X and Y are independent.
      What does it mean for two random variables X and Y to be
      independent? Let A be the set of values that X can take on, B be
      the set of values Y can take on. Then X and Y are independent if
        ∀a∈A ∀b∈b . Pr[X=a ∩ Y=b] = Pr[X=a] Pr[X=b].
      EX: Let X ~ Bin(n, p) and define indicator random variables X_i
          as before. Then E(X_i^2) = p, so Var(X_i) = p - p^2 =
          p(1-p). Then Var(X) = Var(X_1) + ... + Var(X_n) = np(1-p),
          as before.
  The proofs of (1) and (2) are straightforward from the definition
  of variance. We will come back to (3) later.