Administrative info HW7 out, due Monday MT2 next Tuesday Same location and policies as MT1 Cover through polling/LLN (Wednesday) Review Recall that the variance of a random variable Z with expectation E(Z) = μ is Var(Z) = E((Z - μ)^2). It tells us something about the deviation of Z from its mean. The "standard deviation" of Z is σ(Z) = √{Var(Z)}, which in some sense undoes the square in the variance. An alternative expression for variance is Var(Z) = E(Z^2) - μ^2. For a binomial random variable X ~ Bin(n, p), Var(X) = np(1-p). For a geometric random variable Y ~ Geom(p), Var(Y) = (1-p)/p^2. For a Poisson random variable Z ~ Poiss(λ), Var(Z) = λ. Here are some useful facts about variance. (1) Var(cX) = c^2 Var(X), where c is a constant. (2) Var(X+c) = Var(X), where c is a constant. (3) Var(X+Y) = Var(X) + Var(Y) if X and Y are independent. We defined independence for random variables as follows. Let X and Y be random variables, A be the set of values that X can take on, B be the set of values Y can take on. Then X and Y are independent if ∀a∈A ∀b∈b . Pr[X=a ∩ Y=b] = Pr[X=a] Pr[X=b]. Equivalently, X and Y are independent if ∀a∈A ∀b∈b . Pr[X=a|Y=b] = Pr[X=a]. We can similarly define mutual independence for more than two random variables. Chebyshev's inequality So far, we have been working with the squares of deviations from the mean. However, we are more interested in the absolute value of such deviations. Now we will see how the two can be related. Let X be an arbitrary random variable, μ = E(X), and α be any deviation of interest α > 0. Then Pr[|X - μ| >= α] <= Var(X)/α^2. This is called "Chebyshev's inequality." It tells us that the probability that X deviates from its mean by at least α is at most Var(X)/α^2. Another expression of Chebyshev's inequality, with α = βσ, where σ = √{Var(X)} is the standard deviation of X, is Pr[|X - μ| >= βσ] <= 1/β^2. This tells us, for example, that the probability of X deviating from the mean by more than two standard deviations is no more than 1/4. The proof of Chebyshev's inequality relies on "Markov's inequality:" Pr[X >= α] <= E(X)/α for a random variable X that only takes on non-negative values. Generally speaking, Chebyshev's inequality gives us a tighter bound than Markov's, since it uses more information about X, namely its variance. Of course, if X takes on negative values, then we can't use Markov's inequality anyway. The proofs of both inequalities are in the reader. EX: In the random walk, we had E(Y) = 0, Var(Y) = n. Then if we take n = 1,000,000 steps, the probability we end up at least 10,000 steps away is at most Pr[|Y - 0| >= 10^4] <= 10^6/(10^4)^2 = 1/100. EX: If I have a probability p = 1/100 of passing my driving exam, then if T is the number of attempts it takes, T ~ Geom(1/100), E(T) = 100, Var(T) = 99/100 / (1/100)^2 = 9900. Then the probability it takes me at least 900 attempts is at most Pr[T >= 900] = Pr[T - 100 >= 800] <= Pr[|T - 100| >= 800] <= 9900/800^2 ≈ 0.015. In the second line, we used the fact that Pr[|T-100| >= 800] = Pr[T-100 >= 800] + Pr[T-100 <= -800], so Pr[|T-100| >= 800] >= Pr[T-100 >= 800]. This is a useful trick. (In this case, the two are equal, since T only takes on positive values.) Since T only takes on non-negative values, we can use Markov's bound. Let's see what we get. Pr[T >= 900] <= 100/900 = 1/9 ≈ 0.11. This is a much weaker bound. Polling Suppose we want to estimate the proportion p of Democrats in California. How many residents do we need to ask in order to be 95% sure that our estimate of p is within 0.05 of the actual value? This is an example of polling, where we want to estimate some fraction p to within a given "error", specified by ε at a given "confidence level", specified by δ, the uncertainty in our estimate. Here, ε = 0.05, and δ = 1-0.95 = 0.05. (In real-life polls, you will see these two values given along with the result of the poll.) Let's formalize this problem. Suppose we ask n random Californians whether or not they are Democrats. Let S_n be the number of people who say yes. Then we estimate the fraction p as A_n = S_n/n, i.e. the fraction of people we sampled who are Democrats. As usual, we define indicator random variables X_i = {1 if the ith person is a Democrat {0 otherwise. Then S_n = X_1 + X_2 + ... + X_n. Since each person is randomly chosen, Pr[X_i = 1] = p, the proportion of Democrats in the entire population. (Of course, we don't know p. But it has some value, so our analysis will hold, as long as we don't make any assumptions about its value.) As always for an indicator random variable, E(X_i) = Pr[X_i = 1], so E(X_i) = p. Then by linearity of expectation, E(S_n) = np. What above Var(S_n)? For each X_i, we have Var(X_i) = E(X_i^2) - E(X_i)^2 = p - p^2 = p(1-p), Then we notice that the X_i are mutually independent, since each person polled is chosen independently and uniformly at random. Therefore, using fact (3) above, we get Var(S_n) = Var(X_1) + ... + Var(X_n) = n Var(X_i) = np(1-p). Of course, we could have noticed that S_n ~ Bin(n, p) and arrived at E(S_n) and Var(S_n) immediately. But the process above works in the general case, in which X_i is not an indicator random variable, as long as they are "independent and identically distributed," abbreviated as "i.i.d.". For example, we may want to estimate the average wealth of Californians, in which case X_i has a more complicated distribution. We would still get E(X_i) = p, but we would now need to place a bound on Var(X_i) rather than using the exact value. Finally, let's consider our estimate A_n = S_n/n. By linearity of expectation, E(A_n) = 1/n E(S_n) = p. Similarly, by fact (1) above, we get Var(A_n) = 1/n^2 Var(S_n) = p(1-p)/n. This is good news, since we expect the estimate A_n to be p, and the variance of A_n goes down linearly as we increase the sample size n. But how big does n have to be to achieve the required error and confidence? We want no more than a δ probability that the estimate A_n will deviate from p by more than ε, or Pr[|A_n - p| >= ε] <= δ. We can use Chebyshev's inequality to bound the left hand side. Plugging in ε = 0.05, we get Pr[|A_n - p] >= 0.05] <= Var(A_n)/0.05^2 = p(1-p)/(0.0025n). Unfortunately, we don't know p, since that is what we are trying to estimate. But we can place an upper bound on p(1-p) p(1-p) <= 1/4, so Pr[|A_n - p| >= 0.05] <= p(1-p)/(0.0025n) <= 0.25/(0.0025n) = 100/n. Now we wanted Pr[|A_n - p| >= 0.05] <= 100/n <= 0.05, so 100/n <= 0.05 n >= 2000. Thus, if we poll at least 2000 random Californians, we can be 95% sure that our estimate of p is within 0.05 of the actual value. We can repeat the above procedure in the general case, and we will find that it is enough for the sample size n to satisfy n >= σ^2 1/(ε^2 δ), where σ^2 = Var(X_i). In practice, we would use an upper bound on σ^2, since we don't know what its exact value is. The Law of Large Numbers As we noted above, Var(A_n) decreases as the sample size increases. This implies that A_n converges to its expected value as n increases. A_n is the average of the random variables X_i. In general, if we make many observations of i.i.d. random variables, then their average converges to the expected value. We formalize this as the "law of large numbers." Let X_1, ..., X_n be i.i.d. random variables with expectation μ = E(X_i). Let A_n = 1/n (X_1 + ... + X_n). Then for any α > 0, we have Pr[|A_n - μ| >= α] -> 0 as n -> ∞. Proof: Let σ^2 = Var(X_i). Then E(A_n) = μ and Var(A_n) = σ^2/n. By Chebyshev's inequality, we have Pr[|A_n - μ| >= α] <= Var(A_n)/α^2 = σ^2/(nα^2) which goes to 0 as n -> ∞. Thus, as the number of samples increases, the law of large numbers tells us that the deviation of their average A_n from the mean μ tends to zero. Note that the law of large numbers does not tell us that the deviation of their sum S_n = X_1 + ... + X_n from nμ tends to zero! For example, if we recall our coin flipping game, then X_i = +/- 1, and S_n is the amount we win after playing n rounds. While the average amount A_n we win in each round tends to zero as we play more rounds, the total amount S_n does not. In fact, the Chebyshev bound on S_n diverges as n becomes large, since Var(S_n) = nσ^2. The assumption that S_n converges to its expectation is known as the "gambler's fallacy." It manifests itself in gamblers thinking that if they see a run of heads when flipping a fair coin, for example, correspondingly more tails will show up to even things out. You will see the gambler's fallacy in many other areas as well. For example, if a baseball player is having a cold streak at the plate, it will be often said that he is "due" for a hit. This is false, of course, if we take at-bats to be independent. In fact, if we take psychology into account, then it is probably more likely that the cold streak will continue. The lesson here is that the law of large numbers applies to the average of many observations, but not to the sum.