Administrative info
  HW7 out, due Monday
  MT2 next Tuesday
    Same location and policies as MT1
    Cover through polling/LLN (Wednesday)

Review
  Recall that the variance of a random variable Z with expectation
  E(Z) = μ is
    Var(Z) = E((Z - μ)^2).
  It tells us something about the deviation of Z from its mean.

  The "standard deviation" of Z is
    σ(Z) = √{Var(Z)},
  which in some sense undoes the square in the variance.

  An alternative expression for variance is
    Var(Z) = E(Z^2) - μ^2.

  For a binomial random variable X ~ Bin(n, p), Var(X) = np(1-p).

  For a geometric random variable Y ~ Geom(p), Var(Y) = (1-p)/p^2.

  For a Poisson random variable Z ~ Poiss(λ), Var(Z) = λ.

  Here are some useful facts about variance.
  (1) Var(cX) = c^2 Var(X), where c is a constant.
  (2) Var(X+c) = Var(X), where c is a constant.
  (3) Var(X+Y) = Var(X) + Var(Y) if X and Y are independent.

  We defined independence for random variables as follows. Let X and Y
  be random variables, A be the set of values that X can take on, B be
  the set of values Y can take on. Then X and Y are independent if
    ∀a∈A ∀b∈b . Pr[X=a ∩ Y=b] = Pr[X=a] Pr[X=b].
  Equivalently, X and Y are independent if
    ∀a∈A ∀b∈b . Pr[X=a|Y=b] = Pr[X=a].

  We can similarly define mutual independence for more than two random
  variables.

Chebyshev's inequality
  So far, we have been working with the squares of deviations from the
  mean. However, we are more interested in the absolute value of such
  deviations. Now we will see how the two can be related.

  Let X be an arbitrary random variable, μ = E(X), and α be
  any deviation of interest α > 0. Then
    Pr[|X - μ| >= α] <= Var(X)/α^2.
  This is called "Chebyshev's inequality." It tells us that the
  probability that X deviates from its mean by at least α is at
  most Var(X)/α^2.

  Another expression of Chebyshev's inequality, with α =
  βσ, where σ = √{Var(X)} is the standard
  deviation of X, is
    Pr[|X - μ| >= βσ] <= 1/β^2.
  This tells us, for example, that the probability of X deviating from
  the mean by more than two standard deviations is no more than 1/4.

  The proof of Chebyshev's inequality relies on "Markov's inequality:"
    Pr[X >= α] <= E(X)/α
  for a random variable X that only takes on non-negative values.

  Generally speaking, Chebyshev's inequality gives us a tighter bound
  than Markov's, since it uses more information about X, namely its
  variance. Of course, if X takes on negative values, then we can't
  use Markov's inequality anyway.

  The proofs of both inequalities are in the reader.

  EX: In the random walk, we had E(Y) = 0, Var(Y) = n. Then if we
      take n = 1,000,000 steps, the probability we end up at least
      10,000 steps away is at most
        Pr[|Y - 0| >= 10^4] <= 10^6/(10^4)^2
          = 1/100.

  EX: If I have a probability p = 1/100 of passing my driving exam,
      then if T is the number of attempts it takes, T ~ Geom(1/100),
      E(T) = 100, Var(T) = 99/100 / (1/100)^2 = 9900. Then the
      probability it takes me at least 900 attempts is at most
        Pr[T >= 900] = Pr[T - 100 >= 800]
          <= Pr[|T - 100| >= 800]
          <= 9900/800^2
          ≈ 0.015.
      In the second line, we used the fact that
        Pr[|T-100| >= 800] = Pr[T-100 >= 800] + Pr[T-100 <= -800],
      so
        Pr[|T-100| >= 800] >= Pr[T-100 >= 800].
      This is a useful trick. (In this case, the two are equal, since
      T only takes on positive values.)

      Since T only takes on non-negative values, we can use Markov's
      bound. Let's see what we get.
        Pr[T >= 900] <= 100/900
          = 1/9 ≈ 0.11.
      This is a much weaker bound.

Polling
  Suppose we want to estimate the proportion p of Democrats in
  California. How many residents do we need to ask in order to be 95%
  sure that our estimate of p is within 0.05 of the actual value?

  This is an example of polling, where we want to estimate some
  fraction p to within a given "error", specified by ε at a
  given "confidence level", specified by δ, the uncertainty in
  our estimate. Here, ε = 0.05, and δ = 1-0.95 = 0.05.
  (In real-life polls, you will see these two values given along with
  the result of the poll.)

  Let's formalize this problem. Suppose we ask n random Californians
  whether or not they are Democrats. Let S_n be the number of people
  who say yes. Then we estimate the fraction p as A_n = S_n/n, i.e.
  the fraction of people we sampled who are Democrats.

  As usual, we define indicator random variables
    X_i = {1 if the ith person is a Democrat
          {0 otherwise.
  Then S_n = X_1 + X_2 + ... + X_n. Since each person is randomly
  chosen, Pr[X_i = 1] = p, the proportion of Democrats in the entire
  population. (Of course, we don't know p. But it has some value, so
  our analysis will hold, as long as we don't make any assumptions
  about its value.)

  As always for an indicator random variable, E(X_i) = Pr[X_i = 1], so
  E(X_i) = p. Then by linearity of expectation, E(S_n) = np.

  What above Var(S_n)? For each X_i, we have
    Var(X_i) = E(X_i^2) - E(X_i)^2
      = p - p^2
      = p(1-p),
  Then we notice that the X_i are mutually independent, since each
  person polled is chosen independently and uniformly at random.
  Therefore, using fact (3) above, we get
    Var(S_n) = Var(X_1) + ... + Var(X_n)
      = n Var(X_i)
      = np(1-p).

  Of course, we could have noticed that S_n ~ Bin(n, p) and arrived at
  E(S_n) and Var(S_n) immediately. But the process above works in the
  general case, in which X_i is not an indicator random variable, as
  long as they are "independent and identically distributed,"
  abbreviated as "i.i.d.". For example, we may want to estimate the
  average wealth of Californians, in which case X_i has a more
  complicated distribution. We would still get E(X_i) = p, but we
  would now need to place a bound on Var(X_i) rather than using the
  exact value.

  Finally, let's consider our estimate A_n = S_n/n. By linearity of
  expectation, E(A_n) = 1/n E(S_n) = p. Similarly, by fact (1) above,
  we get Var(A_n) = 1/n^2 Var(S_n) = p(1-p)/n.

  This is good news, since we expect the estimate A_n to be p, and the
  variance of A_n goes down linearly as we increase the sample size n.
  But how big does n have to be to achieve the required error and
  confidence?

  We want no more than a δ probability that the estimate A_n will
  deviate from p by more than ε, or
    Pr[|A_n - p| >= ε] <= δ.
  We can use Chebyshev's inequality to bound the left hand side.
  Plugging in ε = 0.05, we get
    Pr[|A_n - p] >= 0.05] <= Var(A_n)/0.05^2
      = p(1-p)/(0.0025n).
  Unfortunately, we don't know p, since that is what we are trying to
  estimate. But we can place an upper bound on p(1-p)
    p(1-p) <= 1/4,
  so
    Pr[|A_n - p| >= 0.05] <= p(1-p)/(0.0025n)
      <= 0.25/(0.0025n)
      = 100/n.
  Now we wanted
    Pr[|A_n - p| >= 0.05] <= 100/n <= 0.05,
  so
    100/n <= 0.05
    n >= 2000.
  Thus, if we poll at least 2000 random Californians, we can be 95%
  sure that our estimate of p is within 0.05 of the actual value.

  We can repeat the above procedure in the general case, and we will
  find that it is enough for the sample size n to satisfy
    n >= σ^2 1/(ε^2 δ),
  where σ^2 = Var(X_i). In practice, we would use an upper bound
  on σ^2, since we don't know what its exact value is.

The Law of Large Numbers
  As we noted above, Var(A_n) decreases as the sample size increases.
  This implies that A_n converges to its expected value as n
  increases. A_n is the average of the random variables X_i. In
  general, if we make many observations of i.i.d. random variables,
  then their average converges to the expected value. We formalize
  this as the "law of large numbers."

  Let X_1, ..., X_n be i.i.d. random variables with expectation μ =
  E(X_i). Let A_n = 1/n (X_1 + ... + X_n). Then for any α > 0,
  we have
    Pr[|A_n - μ| >= α] -> 0 as n -> ∞.
  Proof:
    Let σ^2 = Var(X_i). Then E(A_n) = μ and Var(A_n) =
    σ^2/n. By Chebyshev's inequality, we have
      Pr[|A_n - μ| >= α] <= Var(A_n)/α^2
        = σ^2/(nα^2)
    which goes to 0 as n -> ∞.

  Thus, as the number of samples increases, the law of large numbers
  tells us that the deviation of their average A_n from the mean μ
  tends to zero.

  Note that the law of large numbers does not tell us that the
  deviation of their sum S_n = X_1 + ... + X_n from nμ tends to
  zero! For example, if we recall our coin flipping game, then X_i =
  +/- 1, and S_n is the amount we win after playing n rounds. While
  the average amount A_n we win in each round tends to zero as
  we play more rounds, the total amount S_n does not. In fact,
  the Chebyshev bound on S_n diverges as n becomes large, since
  Var(S_n) = nσ^2.

  The assumption that S_n converges to its expectation is known as the
  "gambler's fallacy." It manifests itself in gamblers thinking that
  if they see a run of heads when flipping a fair coin, for example,
  correspondingly more tails will show up to even things out.

  You will see the gambler's fallacy in many other areas as well. For
  example, if a baseball player is having a cold streak at the plate,
  it will be often said that he is "due" for a hit. This is false, of
  course, if we take at-bats to be independent. In fact, if we take
  psychology into account, then it is probably more likely that the
  cold streak will continue.

  The lesson here is that the law of large numbers applies to the
  average of many observations, but not to the sum.