Administrative info HW7 out, due Monday MT2 next Tuesday Same location and policies as MT1 Cover through polling/LLN (Wednesday) Review We have now seen three important distributions. The first is the binomial distribution. A random variable X ~ Bin(n, p) has the distribution Pr[X = i] = C(n, i) p^i (1-p)^(n-i) for integer i, 0 <= i <= n. This distribution arises whenever we have a fixed number of trials n, the trials are mutually independent, the probability of success of any one trial is p, and we are counting the number of successes. The expectation of X is E(X) = np. The second is the geometric distribution. A random variable Y ~ Geom(p) has the distribution Pr[Y = i] = p(1-p)^(i-1) for i ∈ Z^+. This distribution arises whenever we have independent trials, the probability of success of any one trial is p, and we are interested in the first success. The expectation of Y is E(Y) = 1/p. The third is the Poisson distribution. A random variable Z ~ Poiss(λ) has the distribution Pr[Z = i] = (λ^i)/i! e^{-λ} for i ∈ N. This distribution is the limit of the binomial distribution when n is large and p is small. It is used to model the occurrence of rare events. The expectation of Z is E(Z) = λ. Poisson Distribution The Poisson distribution is widely used for modeling rare events. It is a good approximation of the binomial distribution when n >= 20 and p <= 0.05, and a very good approximation when n >= 100 and np <= 10. EX: Suppose a web server gets an average of 100K requests a day. Each request takes 1 second to handle. How many servers are needed to handle requests? ANS: The website has an unknown number of customers n, and there is a tiny probability p of each person making a request in any 1 second time period. Thus, the rare event is a person choosing to make a request, and we can use the Poisson distribution to model this situation. (We don't actually know n or p, so we couldn't use the binomial distribution even if we wanted to.) Since there are 100K requests a day on average, the average number of requests in a 1 second time period is λ = 100000/(24*3600) ≈ 1.2. Unlike n and p, this can be measured directly, allowing us to use the Poisson distribution. Let R be the number of requests in a 1 second period. Then R ~ Poiss(1.2), and Pr[R = i] = (λ^i)/i! e^{-λ}. Plugging in λ = 1.2, we get the following values: i Pr[R = i] Pr[R <= i] 0 0.301 0.301 1 0.361 0.662 2 0.217 0.879 3 0.087 0.966 4 0.026 0.992 5 0.006 0.999. So if we have 5 servers, we can handle all requests without overloading the servers 99.9% of the time. (Note that we assumed a uniform distribution of requests over the entire day. If this is not the case, we can measure the average number of requests in the busiest 1 second time period and use this as λ. The rest of our analysis will be the same.) Variance Consider a random walk: I flip a (fair) coin, and if it is heads, I take a step to the right, but if it is tails, I take a step to the left. (This models many situations: a drunken sailor, the value of my stock account, our coin flipping game from a previous lecture.) How far from the starting point can I expect to be after n flips? Let X_i be a random variable (not an indicator r.v!) that is +1 if the ith flip is heads, -1 if it is tails. Let Y be my position after n flips. Then X_i = {1 with pr. 1/2, -1 with pr. 1/2} Y = X_1 + ... + X_n What is E(Y)? We have E(X_i) = 0 E(Y) = E(X_1) + ... + E(X_n) = 0. So I can expect to be back where I started. This isn't, however, exactly what the question asked. We wanted to know our distance from the starting point, not where we end up. What we actually want to know is E(|Y|). Unfortunately, the random variable |Y| is difficult to work with. So let's work with Y^2 instead, which will always be positive. Then we will take a square root at the end to learn something about how far we typically are from the starting point. (Note that it is not true that √{E(Z^2)} = E(|Z|). As a simple counterexample, consider an indicator random variable Z with Pr[Z = 1] = p. Then E(|Z|) = E(Z) = p, but E(Z^2) = p, so √{E(Z^2)} = √{p} ≠ E(|Z|). We will see later how to relate |Z| and Z^2.) We have E(Y^2) = E((X_1 + ... + X_n)^2) = E(∑_{i,j} X_i X_j) = ∑_{i,j} E(X_i X_j). In the above summations, i,j are in the range 1 <= i,j <= n, so there are n^2 terms. What is E(X_i X_j)? There are two cases: (1) i = j Then E(X_i X_j) = E(X_i^2) = 1, since X_i^2 is always 1. (2) i ≠ j Let's enumerate the possiblities for X_i X_j: X_i X_j X_i X_j prob. 1 1 1 1/4 1 -1 -1 1/4 -1 1 -1 1/4 -1 -1 1 1/4 In the last column, we used the fact that different coin flips are independent, so the events X_i = a and X_j = b are independent. Putting this together, we get Pr[X_i X_j = 1] = 1/2 Pr[X_i X_j = -1] = 1/2 E(X_i X_j) = 0. In our summation, there are n terms that fall under case (1) and n^2 - n that fall under case (2), so we get E(Y^2) = n * 1 + (n^2 - n) * 0 = n. This is called the "variance" of Y, and it tells us something about the spread of the random variable Y. More generally, for a random variable Z with arbitrary expectation E(Z) = μ, we define the variance to be Var(Z) = E((Z - μ)^2). It tells us something about the deviation of Z from its mean. The "standard deviation" of Z is σ(Z) = √{Var(Z)}, which in some sense undoes the square in the variance. (Why do we have both variance and standard deviation? Variance is easier to work with, but standard deviation is on the same scale as the random variable, so it gives us a better idea about the typical deviation from the mean.) In the random walk, σ(Y) = √{n}. An alternative expression for variance is Var(X) = E(X^2) - μ^2. Proof: Var(X) = E((X - μ)^2) = E(X^2 - 2Xμ + μ^2) = E(X^2) - 2μE(X) + μ^2 = E(X^2) - 2μ^2 + μ^2 = E(X^2) - μ^2. In the third step above, we used linearity of expectation. Let's do some more examples. Uniform distribution Let X be a random variable with uniform distribution in 1,...,n. Then μ = E(X) = 1/n (1 + ... + n) = 1/n n(n+1)/2 = (n+1)/2 μ^2 = (n+1)^2/4 = 3(n+1)^2/12 E(X^2) = 1/n (1 + 4 + ... + n^2) = 1/n ∑_{i=1}^n i^2 = 1/n n(n+1)(2n+1)/6 = (n+1)(2n+1)/6 = 2(n+1)(2n+1)/12 Var(X) = E(X^2) - μ^2 = 2(n+1)(2n+1)/12 - 3(n+1)^2/12 = (n+1)/12 (4n+2 - 3n-3) = (n+1)(n-1)/12 = (n^2-1)/12. Compare this variance to that of the random walk; this is on the order of n^2, while that of the random walk was on the order of n. This should make sense, since in the case of the random walk, it's much more likely to be closer to the mean than further, unlike in a uniform distribution. (The probability "mass" is concentrated near the mean, while in a uniform distribution, it is spread out.) EX: Let X be the result of a roll of a fair die. What is Var(X)? ANS: Var(X) = (6^2-1)/12 = 35/12 σ(X) ≈ 1.7. Binomial distribution Let X ~ Bin(n, p). Then we proceed as in the random walk. Let X_i be an indicator random variable that is 1 if the ith trial succeeds. Then X = X_1 + ... + X_n E(X^2) = E(∑_{i,j} X_i X_j) = ∑_{i,j} E(X_i X_j). For E(X_i X_j), we have two cases. (1) i = j Then E(X_i X_j) = E(X_i^2) = p, since Pr[X_i^2 = 1] = p. (2) i ≠ j Let's enumerate the possiblities for X_i X_j: X_i X_j X_i X_j prob. 1 1 1 p^2 1 0 0 p(1-p) 0 1 0 p(1-p) 0 0 0 (1-p)^2 In the last column, we used the fact that different coin flips are independent. Thus, Pr[X_i X_j = 1] = p^2, and E(X_i X_j) = p^2. There are n terms in the summation that fall under case (1), n^2 - n that fall under case (2), so we get E(X^2) = np + (n^2-n)p^2 = np + n^2 p^2 - np^2 = n^2 p^2 + np(1-p). Then Var(X) = E(X^2) - E(X)^2 = n^2 p^2 + np(1-p) - n^2 p^2 = np(1-p). Geometric distribution Let X ~ Geom(p). Then E(X^2) = p + 4p(1-p) + 9p(1-p)^2 + 16p(1-p)^3 + ... Multiplying this by (1-p), we get (1-p)E(X^2) = p(1-p) + 4p(1-p)^2 + 9p(1-p)^3 + ... Subtracting, we get pE(X^2) = p + 3p(1-p) + 5p(1-p)^2 + 7p(1-p)^3 + ... = 2[p + 2p(1-p) + 3p(1-p)^2 + 4p(1-p)^3 + ...] - [p + p(1-p) + p(1-p)^2 + p(1-p)^3 + ...] The first sum is just E(X), and the second is the sum of the probabilities of each of the outcomes, so it is 1. Thus, pE(X^2) = 2E(X) - 1 = 2/p - 1 = (2-p)/p E(X^2) = (2-p)/p^2. Then Var(X) = E(X^2) - E(X)^2 = (2-p)/p^2 - 1/p^2 = (1-p)/p^2. Here are some useful facts about variance. (1) Var(cX) = c^2 Var(X), where c is a constant. (2) Var(X+c) = Var(X), where c is a constant. EX: In the random walk, we have Y = 2H - n, where H is the number of heads. (We showed this in a previous lecture.) So Var(Y) = 4Var(H). We've computed Var(H) = np(1-p), so Var(Y) = 4np(1-p). For a fair coin, p = 1/2, and we get Var(Y) = 4n(1/2)(1/2) = n, as before. (3) Var(X+Y) = Var(X) + Var(Y) if X and Y are independent. What does it mean for two random variables X and Y to be independent? Let A be the set of values that X can take on, B be the set of values Y can take on. Then X and Y are independent if ∀a∈A ∀b∈b . Pr[X=a ∩ Y=b] = Pr[X=a] Pr[X=b]. EX: Let X ~ Bin(n, p) and define indicator random variables X_i as before. Then E(X_i^2) = p, so Var(X_i) = p - p^2 = p(1-p). Then Var(X) = Var(X_1) + ... + Var(X_n) = np(1-p), as before. The proofs of (1) and (2) are straightforward from the definition of variance. We will come back to (3) later.