Administrative info
  HW8 due Wednesday
  Final exam Thursday 5-8pm in 10 Evans
  No regrades for HW8 (not enough time) or final exam (UCB policy)
  Review session tomorrow 3-5pm in 306 Soda

Review
  We can describe a continuous random variable X in two ways.
  (1) The cumulative distribution function (cdf):
        F(x) = Pr[X <= x].
  (2) The probability density function (pdf):
        f(x) = d/dx F(x).

  The cdf is defined for all random variables, discrete or continuous.
  In HW8 Q12, if you choose to do it, you demonstrate that it contains
  all information about a random variable.

  The exponential distribution X ~ Exp(λ) has pdf
    f(x) = { λ e^{-λ x}   if x >= 0
           { 0            if x < 0
  and cdf
    F(x) = { 1 - e^{-λ x}   if x >= 0
           { 0              if x < 0
  It tells us how long until the first success when the rate of
  success per unit time is λ. The expectation and variance are
    E(X) = 1/λ
    Var(X) = 1/λ^2.

  The normal or Gaussian distribution Y ~ N(μ, σ^2) has pdf
    f(y) = 1/√{2πσ^2} e^{-(y-μ)^2/(2σ^2)}
  and expectation and variance
    E(Y) = μ
    Var(Y) = σ^2.

  The pdf of a normal distribution is a symmetric bell-shaped curve
  centered at μ, with a width determined by σ.

  The cdf of a normal distribution does not have a simple, closed
  form.

  The standard normal distribution has parameters μ = 0, σ
  = 1. So if Z is a standard normal, then
    Z ~ N(0, 1),
  and the pdf of Z is
    g(z) = 1/√{2π} e^{-z^2/2}.

Normal Distribution (cont.)
  We can turn any normal distribution into a standard normal by
  translating and scaling. If X ~ N(μ, σ^2), then let
    Z = (X-μ)/σ.
  Then by linearity of expectation,
    E(Z) = 1/σ (E(X) - μ)
      = 1/σ (μ - μ)
      = 0.
  Similarly, using our variance facts, we have
    Var(Z) = Var((X-μ)/σ)
      = 1/σ^2 Var(X - μ)
      = 1/σ^2 Var(X)
      = 1/σ^2 σ^2
      = 1.
  So we have shown that Z has the right expectation and variance. We
  need to show that Z is normal. Since Z = (X-μ)/σ, we have
  that X = σZ + μ, so
    Pr[a <= Z <= b] = Pr[σa+μ <= X <= σb+μ]
      = 1/√{2πσ^2} ∫_{σa+μ}^{σb+μ} e^{-(x-μ)^2/(2σ^2)} dx.
  We can do a change of variable from x to z, where z =
  (x-μ)/σ, or x = zσ+μ. So the bounds of the
  integral become
     ((σa+μ)-μ)/σ = a
     ((σb+μ)-μ)/σ = b,
  the (x-μ)^2/σ^2 in the exponent becomes z^2, and the dx
  becomes σdz, giving us
    Pr[a <= Z <= b] = 1/√{2π} ∫_a^b e^{-z^2/2)} dz.
  Thus, Z ~ N(0, 1).

  Thus, we can turn any normal into a standard normal, so if we have a
  table of probabilities for the standard normal, we can determine
  probabilities for any normal. Often, probabilities for a standard
  normal are given in a "z-score" table, which tabulates Pr[Z <= z]
  for various values of z, where Z ~ N(0, 1).

  EX: Suppose a set of exam scores follow a normal distribution with a
      mean of 70 and a standard deviation of 10. What is the
      probability that a random student scores at least 90?

      Let X be the student's score. We have X ~ N(70, 100), and we
      want Pr[X >= 90]. Let Z ~ N(0, 1). We get
        Pr[X >= 90] = Pr[(X-70)/10 >= 2]
          = Pr[Z >= 2]
          = Pr[Z <= -2]  (since a normal is symmetric around its mean)
          ≈ 0.02.

  Some features of the normal distribution X ~ N(μ, σ^2):
  (1) The value of X falls within ±σ with probability 0.68.
  (2) The value of X falls within ±2σ with probability 0.95.
  (3) The value of X falls within ±3σ with probability 0.997.

  Useful normal tricks:
  (1) Pr[Z >= z] = Pr[Z <= -z]
  (2) Pr[Z >= z] = 1 - Pr[Z <= z].

  The sum of two independent normally distributed random variables X_1
  ~ N(μ_1, σ_1^2) and X_2 ~ N(μ_2, σ_2^2), Y = X_1 +
  X_2, is also normally distributed, Y ~ N(μ_1+μ_2,
  σ_1^2+σ_2^2). (Of course, you already knew its
  expectation and variance; the important fact is that it is normal.)

  The normal distribution models aggregate results from many
  independent observations of the same random variable, as we will see
  next.

The Central Limit Theorem
  Recall the law of large numbers. Given i.i.d. random variables X_i
  with common mean μ and variance σ^2, we defined the sample
  average as
    A_n = 1/n ∑_{i=1}^n X_i.
  Then A_n has mean μ and variance σ^2/n. This implies, by
  Chebyshev's inequality, that the probability of any deviation
  α from the mean goes to 0 as n->∞:
    Pr[|A_n - μ| >= α] <= Var(A_n)/α^2
      = σ^2/(nα^2)
      -> 0 as n->∞.

  We can actually say something much stronger than the law of large
  numbers: the distribution of A_n tends to the normal distribution
  with mean μ and variance σ^2/n as n becomes large.

  To state this precisely, so that we get a convergence to a single
  distribution, we first scale A_n so that its mean is 0 and variance
  is 1:
    Z_n = (A_n - μ) √n / σ
      = n (A_n - μ) / (σ √n)
      = n (1/n ∑_{i=1}^n X_i - μ) / (σ √n)
      = (∑_{i=1}^n X_i - nμ) / (σ√n).
  Then the distribution of Z_n tends to that of the standard normal
  Z as n->∞, meaning
    ∀α∈R Pr[Z_n <= α] -> Pr[Z <= α] as n->∞.

  Since the sample mean A_n is just a scaling and translation of Z_n,
  it too has an approximately normal distribution for large n, but
  with mean μ and variance σ^2/n. Finally, the sample sum
    S_n = ∑_{i=1}^n X_i
  also has a normal distribution, with parameters nμ and
  nσ^2, since it is just a scaling of the sample mean. (Note
  that as we saw in discussing LLN, the probability of any deviation
  of S_n from its mean does not tend to 0. Its distribution, however,
  does tend to a normal distribution, but with increasing variance as
  n->∞.)

  The central limit theorem tells us that if we take n observations of
  any random variable X_i, no matter what distribution X has (as long
  as its mean and variance are finite, and its variance is nonzero),
  then the distribution of the sample mean or sum tends to that of the
  normal distribution. The sample mean tends to a normal distribution
  with parameters μ and σ^2/n, where μ = E(X_i) and
  σ^2 = Var(X_i), and the sample sum tends to a normal
  distribution with parameters nμ and nσ^2. This explains the
  prevalence of the normal distribution, and it allows us to
  approximate distributions that are the sum of i.i.d. random
  variables.

  The simplest example of the CLT in action is the binomial
  distribution. A binomial random variable X ~ Bin(n, p) is the sum of
  n i.i.d. indicator random variables
    X = X_1 + ... + X_n,
  where
    X_i = { 1  w.p. p
          { 0  w.p. 1-p.
  This explains why the binomial distribution is bell-shaped. It also
  allows us to approximate the binomial distribution using a normal
  distribution with parameters np and np(1-p).

  A standard rule of thumb is that the normal approximation is a
  reasonable approximation if np >= 5 and n(1-p) >= 5.

  EX: Suppose you flip a biased coin with probability p = 0.2 of heads
      100 times. What is the probability that you get more than 30
      heads?

      Let X be the number of heads. Then X ~ Bin(100, 0.2), and np =
      20 > 5 and n(1-p) = 80 > 5. Thus, we can approximate X as a
      normally distributed random variable Y ~ N(20, 16). Then we want
        Pr[X > 30] ≈ Pr[Y > 30]
          = Pr[(Y-20)/4 > 2.5]
          = Pr[Z > 2.5]          (where Z ~ N(0, 1))
          = 1 - Pr[Z < 2.5]
          ≈ 0.006.

  Since the binomial distribution is discrete while the normal
  distribution is continuous, we can get a better approximation by
  applying a "continuity correction." However, we do not require you
  to use a continuity correction in this class.

Illustration of CLT
  Let's do another simple example that illustrates the central limit
  theorem. Consider the case where the X_i are i.i.d. and have the
  uniform distribution
          { 0  w.p. 1/3      1/3 | *  *  *
    X_i = { 1  w.p. 1/3          `---------
          { 2  w.p. 1/3.           0  1  2
  Let Z_n be the sum of X_1, ..., X_n. For Z_2 we get Pr[Z_2 = k] is
  just 1/9 times the number of ways that
    X_1 + ... + X_n = k
  for k = {0, 1, 2}. This is just pirate coins/stars and bars, so it
  is
    C(2+k-1, 2-1).
  Then the distribution is symmetric around the mean, so we get
          { 0  w.p. 1/9      3/9 |       *
          { 1  w.p. 2/9      2/9 |    *  *  *
    Z_2 = { 2  w.p. 3/9      1/9 | *  *  *  *  *
          { 3  w.p. 2/9          `---------------
          { 4  w.p. 1/9.           0  1  2  3  4
  For Z_3, it is a little more complicated, but we get
                            7/27 |          *
          { 0  w.p. 1/27         |       *  *  *
          { 1  w.p. 3/27    5/27 |       *  *  *
          { 2  w.p. 6/27         |       *  *  *
    Z_2 = { 3  w.p. 7/27    3/27 |    *  *  *  *  *
          { 4  w.p. 6/27         |    *  *  *  *  *
          { 5  w.p. 3/27    1/27 | *  *  *  *  *  *  *
          { 6  w.p. 1/27.        `---------------------
                                   0  1  2  3  4  5  6
  We can already see the beginnings of a bell-shaped curve, with the
  sum of just three i.i.d. random variables.

Proof of CLT (Optional)
  The following is an overview of the proof of the central limit
  theorem. It is optional, was not covered in lecture, and will not be
  on the exam, so feel free to skip this section if you are not
  interested.

  We start be defining the "characteristic function" of a random
  variable X as the function
    φ_X(t) = E(e^{itX}),
  i.e. the value of φ_X(t) is the expectation of e^{itX}. Recall
  that a random variable is a function from outcomes to another set,
  so e^{itX} is another random variable defined as
    (e^{itX})(ω) = e^{itX(ω)}.
  This random variable is a function from outcomes to the complex
  numbers, so it has an expectation.

  Like the cdf, the characteristic function encodes all the
  information about a random variable. Also like the cdf, it always
  exists, even when the pdf does not or when the mean and variance do
  not.

  If the pdf does exist, then the characteristic function is its
  (unscaled) Fourier transform:
    φ_X(t) = E(e^{itX}) = ∫_{-∞}^{+∞} e^{itx} f(x) dx,
  where f(x) is the pdf of X.

  In particular, we can compute the characteristic function of a
  normal random variable Y ~ N(μ, σ^2):
    φ_Y(t) = e^{i t μ - 1/2 σ^2 t^2}.
  The characteristic function of a standard normal Z ~ N(0, 1) then is
    φ_Z(t) = e^{- 1/2 t^2}.

  The characteristic function of the sum of two independent random
  variables X and Y is the product of their characteristic functions:
    φ_{X+Y}(t) = E(e^{it(X+Y)})
      = E(e^{itX} e^{itY})
      = E(e^{itX}) E(e^{itY})            (since X, Y independent)
      = φ_X(t) φ_Y(t).
  In the third line, we used the fact that since X and Y are
  independent, e^{itX} and e^{itY} are independent.

  The characteristic function of a scaled random variable cX, where c
  is a constant, is
    φ_{cX}(t) = E(e^{it(cX)})
      = E(e^{i(ct)X})
      = φ_X(ct)
  by a simple change of variable.

  If the mean and variance of a random variable exist and are finite,
  we can use the Taylor expansion of e^x to approximate the
  characteristic function of X/√n:
    φ_{X/√n}(t) = E(e^{itX/√n})
      ≈ E(1 + itX/√n - t^2 X^2 / 2n)
      = 1 + (it/√n) E(X) - (t^2/2n) E(X^2).
  As n->∞, the lower order terms go to 0, so this will be a good
  approximation. If the mean is 0 and the variance 1, then E(X) = 0
  and Var(X) =  E(X^2) - 0 = 1, so we get
    φ_{X/√n}(t) ≈ 1 - t^2/2n.

  Finally, Levy's continuity theorem tells us that if the
  characteristic functions of a sequence of random variables Z_1, Z_2,
  ... converge to the characteristic function of another random
  variable Z, then so too do the cdfs of Z_1, Z_2, ... converge to the
  cdf of Z. This means that they "converge in distribution" to the
  distribution of Z.

  We are now ready to prove the CLT. We will restrict ourselves to the
  case that the individual random variables are i.i.d.

  Consider a set of i.i.d. random variables X_1, ..., X_n with common
  mean μ and variance σ^2, both finite and the variance
  nonzero. Let
    Y_i = (X_i - μ) / σ
  for i = 1, ..., n. Then the Y_i have common mean 0 and variance 1.
  Let
    Z_n = (∑_{i=1}^n X_i - n μ) / (σ √n).
  Then we see that
    Z_n = ∑_{i=1}^n (Y_i/√n).
  Since the Y_i have a mean of 0 and a variance of 1, the
  characteristic function of Y_i/√n is
    φ_{Y_i/√n}(t) ≈ 1 - t^2/2n.
  Then the characteristic function of Z_n is
    φ_{Z_n}(t) = φ_{∑ Y_i/√n}(t)
      = φ_{Y_1/√n}(t) ... φ_{Y_n/√n}(t)
      ≈ (1 - t^2/2n)^n
      ≈ e^{(-t^2/2n) n}                          (as n->∞)
      = e^{- 1/2 t^2}.
  Thus, we see that the characteristic functions of the Z_n converge
  to that of the standard normal as n->∞, so by Levy's
  continuity theorem, the distributions of the Z_n converge to that of
  the standard normal.