Administrative info HW8 due Wednesday Final exam Thursday 5-8pm in 10 Evans No regrades for HW8 (not enough time) or final exam (UCB policy) Review session tomorrow 3-5pm in 306 Soda Review We can describe a continuous random variable X in two ways. (1) The cumulative distribution function (cdf): F(x) = Pr[X <= x]. (2) The probability density function (pdf): f(x) = d/dx F(x). The cdf is defined for all random variables, discrete or continuous. In HW8 Q12, if you choose to do it, you demonstrate that it contains all information about a random variable. The exponential distribution X ~ Exp(λ) has pdf f(x) = { λ e^{-λ x} if x >= 0 { 0 if x < 0 and cdf F(x) = { 1 - e^{-λ x} if x >= 0 { 0 if x < 0 It tells us how long until the first success when the rate of success per unit time is λ. The expectation and variance are E(X) = 1/λ Var(X) = 1/λ^2. The normal or Gaussian distribution Y ~ N(μ, σ^2) has pdf f(y) = 1/√{2πσ^2} e^{-(y-μ)^2/(2σ^2)} and expectation and variance E(Y) = μ Var(Y) = σ^2. The pdf of a normal distribution is a symmetric bell-shaped curve centered at μ, with a width determined by σ. The cdf of a normal distribution does not have a simple, closed form. The standard normal distribution has parameters μ = 0, σ = 1. So if Z is a standard normal, then Z ~ N(0, 1), and the pdf of Z is g(z) = 1/√{2π} e^{-z^2/2}. Normal Distribution (cont.) We can turn any normal distribution into a standard normal by translating and scaling. If X ~ N(μ, σ^2), then let Z = (X-μ)/σ. Then by linearity of expectation, E(Z) = 1/σ (E(X) - μ) = 1/σ (μ - μ) = 0. Similarly, using our variance facts, we have Var(Z) = Var((X-μ)/σ) = 1/σ^2 Var(X - μ) = 1/σ^2 Var(X) = 1/σ^2 σ^2 = 1. So we have shown that Z has the right expectation and variance. We need to show that Z is normal. Since Z = (X-μ)/σ, we have that X = σZ + μ, so Pr[a <= Z <= b] = Pr[σa+μ <= X <= σb+μ] = 1/√{2πσ^2} ∫_{σa+μ}^{σb+μ} e^{-(x-μ)^2/(2σ^2)} dx. We can do a change of variable from x to z, where z = (x-μ)/σ, or x = zσ+μ. So the bounds of the integral become ((σa+μ)-μ)/σ = a ((σb+μ)-μ)/σ = b, the (x-μ)^2/σ^2 in the exponent becomes z^2, and the dx becomes σdz, giving us Pr[a <= Z <= b] = 1/√{2π} ∫_a^b e^{-z^2/2)} dz. Thus, Z ~ N(0, 1). Thus, we can turn any normal into a standard normal, so if we have a table of probabilities for the standard normal, we can determine probabilities for any normal. Often, probabilities for a standard normal are given in a "z-score" table, which tabulates Pr[Z <= z] for various values of z, where Z ~ N(0, 1). EX: Suppose a set of exam scores follow a normal distribution with a mean of 70 and a standard deviation of 10. What is the probability that a random student scores at least 90? Let X be the student's score. We have X ~ N(70, 100), and we want Pr[X >= 90]. Let Z ~ N(0, 1). We get Pr[X >= 90] = Pr[(X-70)/10 >= 2] = Pr[Z >= 2] = Pr[Z <= -2] (since a normal is symmetric around its mean) ≈ 0.02. Some features of the normal distribution X ~ N(μ, σ^2): (1) The value of X falls within ±σ with probability 0.68. (2) The value of X falls within ±2σ with probability 0.95. (3) The value of X falls within ±3σ with probability 0.997. Useful normal tricks: (1) Pr[Z >= z] = Pr[Z <= -z] (2) Pr[Z >= z] = 1 - Pr[Z <= z]. The sum of two independent normally distributed random variables X_1 ~ N(μ_1, σ_1^2) and X_2 ~ N(μ_2, σ_2^2), Y = X_1 + X_2, is also normally distributed, Y ~ N(μ_1+μ_2, σ_1^2+σ_2^2). (Of course, you already knew its expectation and variance; the important fact is that it is normal.) The normal distribution models aggregate results from many independent observations of the same random variable, as we will see next. The Central Limit Theorem Recall the law of large numbers. Given i.i.d. random variables X_i with common mean μ and variance σ^2, we defined the sample average as A_n = 1/n ∑_{i=1}^n X_i. Then A_n has mean μ and variance σ^2/n. This implies, by Chebyshev's inequality, that the probability of any deviation α from the mean goes to 0 as n->∞: Pr[|A_n - μ| >= α] <= Var(A_n)/α^2 = σ^2/(nα^2) -> 0 as n->∞. We can actually say something much stronger than the law of large numbers: the distribution of A_n tends to the normal distribution with mean μ and variance σ^2/n as n becomes large. To state this precisely, so that we get a convergence to a single distribution, we first scale A_n so that its mean is 0 and variance is 1: Z_n = (A_n - μ) √n / σ = n (A_n - μ) / (σ √n) = n (1/n ∑_{i=1}^n X_i - μ) / (σ √n) = (∑_{i=1}^n X_i - nμ) / (σ√n). Then the distribution of Z_n tends to that of the standard normal Z as n->∞, meaning ∀α∈R Pr[Z_n <= α] -> Pr[Z <= α] as n->∞. Since the sample mean A_n is just a scaling and translation of Z_n, it too has an approximately normal distribution for large n, but with mean μ and variance σ^2/n. Finally, the sample sum S_n = ∑_{i=1}^n X_i also has a normal distribution, with parameters nμ and nσ^2, since it is just a scaling of the sample mean. (Note that as we saw in discussing LLN, the probability of any deviation of S_n from its mean does not tend to 0. Its distribution, however, does tend to a normal distribution, but with increasing variance as n->∞.) The central limit theorem tells us that if we take n observations of any random variable X_i, no matter what distribution X has (as long as its mean and variance are finite, and its variance is nonzero), then the distribution of the sample mean or sum tends to that of the normal distribution. The sample mean tends to a normal distribution with parameters μ and σ^2/n, where μ = E(X_i) and σ^2 = Var(X_i), and the sample sum tends to a normal distribution with parameters nμ and nσ^2. This explains the prevalence of the normal distribution, and it allows us to approximate distributions that are the sum of i.i.d. random variables. The simplest example of the CLT in action is the binomial distribution. A binomial random variable X ~ Bin(n, p) is the sum of n i.i.d. indicator random variables X = X_1 + ... + X_n, where X_i = { 1 w.p. p { 0 w.p. 1-p. This explains why the binomial distribution is bell-shaped. It also allows us to approximate the binomial distribution using a normal distribution with parameters np and np(1-p). A standard rule of thumb is that the normal approximation is a reasonable approximation if np >= 5 and n(1-p) >= 5. EX: Suppose you flip a biased coin with probability p = 0.2 of heads 100 times. What is the probability that you get more than 30 heads? Let X be the number of heads. Then X ~ Bin(100, 0.2), and np = 20 > 5 and n(1-p) = 80 > 5. Thus, we can approximate X as a normally distributed random variable Y ~ N(20, 16). Then we want Pr[X > 30] ≈ Pr[Y > 30] = Pr[(Y-20)/4 > 2.5] = Pr[Z > 2.5] (where Z ~ N(0, 1)) = 1 - Pr[Z < 2.5] ≈ 0.006. Since the binomial distribution is discrete while the normal distribution is continuous, we can get a better approximation by applying a "continuity correction." However, we do not require you to use a continuity correction in this class. Illustration of CLT Let's do another simple example that illustrates the central limit theorem. Consider the case where the X_i are i.i.d. and have the uniform distribution { 0 w.p. 1/3 1/3 | * * * X_i = { 1 w.p. 1/3 `--------- { 2 w.p. 1/3. 0 1 2 Let Z_n be the sum of X_1, ..., X_n. For Z_2 we get Pr[Z_2 = k] is just 1/9 times the number of ways that X_1 + ... + X_n = k for k = {0, 1, 2}. This is just pirate coins/stars and bars, so it is C(2+k-1, 2-1). Then the distribution is symmetric around the mean, so we get { 0 w.p. 1/9 3/9 | * { 1 w.p. 2/9 2/9 | * * * Z_2 = { 2 w.p. 3/9 1/9 | * * * * * { 3 w.p. 2/9 `--------------- { 4 w.p. 1/9. 0 1 2 3 4 For Z_3, it is a little more complicated, but we get 7/27 | * { 0 w.p. 1/27 | * * * { 1 w.p. 3/27 5/27 | * * * { 2 w.p. 6/27 | * * * Z_2 = { 3 w.p. 7/27 3/27 | * * * * * { 4 w.p. 6/27 | * * * * * { 5 w.p. 3/27 1/27 | * * * * * * * { 6 w.p. 1/27. `--------------------- 0 1 2 3 4 5 6 We can already see the beginnings of a bell-shaped curve, with the sum of just three i.i.d. random variables. Proof of CLT (Optional) The following is an overview of the proof of the central limit theorem. It is optional, was not covered in lecture, and will not be on the exam, so feel free to skip this section if you are not interested. We start be defining the "characteristic function" of a random variable X as the function φ_X(t) = E(e^{itX}), i.e. the value of φ_X(t) is the expectation of e^{itX}. Recall that a random variable is a function from outcomes to another set, so e^{itX} is another random variable defined as (e^{itX})(ω) = e^{itX(ω)}. This random variable is a function from outcomes to the complex numbers, so it has an expectation. Like the cdf, the characteristic function encodes all the information about a random variable. Also like the cdf, it always exists, even when the pdf does not or when the mean and variance do not. If the pdf does exist, then the characteristic function is its (unscaled) Fourier transform: φ_X(t) = E(e^{itX}) = ∫_{-∞}^{+∞} e^{itx} f(x) dx, where f(x) is the pdf of X. In particular, we can compute the characteristic function of a normal random variable Y ~ N(μ, σ^2): φ_Y(t) = e^{i t μ - 1/2 σ^2 t^2}. The characteristic function of a standard normal Z ~ N(0, 1) then is φ_Z(t) = e^{- 1/2 t^2}. The characteristic function of the sum of two independent random variables X and Y is the product of their characteristic functions: φ_{X+Y}(t) = E(e^{it(X+Y)}) = E(e^{itX} e^{itY}) = E(e^{itX}) E(e^{itY}) (since X, Y independent) = φ_X(t) φ_Y(t). In the third line, we used the fact that since X and Y are independent, e^{itX} and e^{itY} are independent. The characteristic function of a scaled random variable cX, where c is a constant, is φ_{cX}(t) = E(e^{it(cX)}) = E(e^{i(ct)X}) = φ_X(ct) by a simple change of variable. If the mean and variance of a random variable exist and are finite, we can use the Taylor expansion of e^x to approximate the characteristic function of X/√n: φ_{X/√n}(t) = E(e^{itX/√n}) ≈ E(1 + itX/√n - t^2 X^2 / 2n) = 1 + (it/√n) E(X) - (t^2/2n) E(X^2). As n->∞, the lower order terms go to 0, so this will be a good approximation. If the mean is 0 and the variance 1, then E(X) = 0 and Var(X) = E(X^2) - 0 = 1, so we get φ_{X/√n}(t) ≈ 1 - t^2/2n. Finally, Levy's continuity theorem tells us that if the characteristic functions of a sequence of random variables Z_1, Z_2, ... converge to the characteristic function of another random variable Z, then so too do the cdfs of Z_1, Z_2, ... converge to the cdf of Z. This means that they "converge in distribution" to the distribution of Z. We are now ready to prove the CLT. We will restrict ourselves to the case that the individual random variables are i.i.d. Consider a set of i.i.d. random variables X_1, ..., X_n with common mean μ and variance σ^2, both finite and the variance nonzero. Let Y_i = (X_i - μ) / σ for i = 1, ..., n. Then the Y_i have common mean 0 and variance 1. Let Z_n = (∑_{i=1}^n X_i - n μ) / (σ √n). Then we see that Z_n = ∑_{i=1}^n (Y_i/√n). Since the Y_i have a mean of 0 and a variance of 1, the characteristic function of Y_i/√n is φ_{Y_i/√n}(t) ≈ 1 - t^2/2n. Then the characteristic function of Z_n is φ_{Z_n}(t) = φ_{∑ Y_i/√n}(t) = φ_{Y_1/√n}(t) ... φ_{Y_n/√n}(t) ≈ (1 - t^2/2n)^n ≈ e^{(-t^2/2n) n} (as n->∞) = e^{- 1/2 t^2}. Thus, we see that the characteristic functions of the Z_n converge to that of the standard normal as n->∞, so by Levy's continuity theorem, the distributions of the Z_n converge to that of the standard normal.