Administrative info HW7 out, due Monday MT2 next Tuesday Same location and policies as MT1 Cover through polling/LLN (Wednesday) Review session Monday 6:30-8 in 320 Soda Exams will be available for pickup from Soda front office Review Recall that we defined independence for random variables as follows. Let X and Y be random variables, A be the set of values that X can take on, B be the set of values Y can take on. Then X and Y are independent if ∀a∈A ∀b∈B . Pr[X=a ∩ Y=b] = Pr[X=a] Pr[X=b]. Equivalently, X and Y are independent if ∀a∈A ∀b∈B . Pr[X=a|Y=b] = Pr[X=a]. We can similarly define mutual independence for more than two random variables. Independent Random Variables We have used the fact that Var(X+Y) = Var(X) + Var(Y) for independent random variables X and Y. We now prove that fact. For independent random variables X and Y, we have E(XY) = E(X)E(Y). Proof: E(XY) = ∑_{a,b} ab Pr[X=a ∩ Y=b] = ∑_{a,b} ab Pr[X=a] Pr[Y=b] = ∑_{a} ∑_{b} ab Pr[X=a] Pr[Y=b] = (∑_{a} a Pr[X=a]) * (∑_{b} b Pr[Y=b]) = E(X) * E(Y). We have already claimed that for independent random variables X and Y, Var(X+Y) = Var(X) + Var(Y). Now we prove it. Var(X+Y) = E((X+Y)^2) - E(X+Y)^2 = E(X^2 + 2XY + Y^2) - (E(X) + E(Y))^2 = E(X^2) + 2E(XY) + E(Y^2) - E(X)^2 - 2E(X)E(Y) - E(Y)^2 = E(X^2) + 2E(X)E(Y) + E(Y^2) - E(X)^2 - 2E(X)E(Y) - E(Y)^2 = E(X^2) + E(Y^2) - E(X)^2 - E(Y)^2 = Var(X) + Var(Y). What about if X and Y are not independent? Let's consider the extreme case Y = X. Then E(XY) = E(XX) = E(X^2) ≠ E(X)^2 in general Var(X+Y) = Var(2X) = 4Var(X) ≠ 2Var(X). So the two facts above do not hold if X and Y are not independent. Joint Distribution The fact that two random variables X and Y are independent gives us all the information that we need in order to perform calculations involving those two variables. We now turn our attention to the general case when X and Y are not independent. Recall that for non-independent events A and B, in order to compute probabilities of the form Pr[A ∩ B] Pr[A ∪ B] Pr[A|B] we needed to know Pr[A ∩ B]. This quantity encodes all the information about how A and B are correlated. For random variables, we need something similar. We need the values Pr[X = a ∩ Y = b] for all values a in the set of values that X can take on and all values b in the set of values that Y can take on. This set of probabilities is called the "joint distribution" of X and Y. Since we now will be making quite heavy use of intersections and writing ∩ everywhere is tedious, we use a comma instead to denote intersection: Pr[X = a, Y = b]. Before we continue with examples of joint distributions, we first generalize our definition of random variables. Previously, we defined a random variable X on a probability space Ω as a function from Ω to the real numbers X : Ω -> R, so X(ω) is a real number for all ω ∈ Ω. We can generalize the range of a random variable to be any set S: X : Ω -> S. For example, in PA3, we went to the trouble of defining a numerical value for winning, drawing, and losing, and similarly for each type of hand, in order to define the random variables W and O. Instead, we could have defined the range of W to be the set S = {Lose, Draw, Win}. Then W would assign an element of S to each outcome ω. ∀ ω ∈ Ω . W(ω) ∈ S. If the range of a random variable is not a subset of R, however, we cannot talk about its expectation or variance. (In the case of PA3, the members of S are ordered, so it made sense to use numerical values so that we can compute expectation and variance, which give us information about how much money we expect to win or lose. Similarly, the types of hands are ordered as well.) We now turn to examples of joint distributions using generalized random variables. Suppose we are trying to diagnose a rare ailment (say neurocysticercosis or something else you would see on House, M.D.) according to the severity of a particular symptom. Let X be random variable that is 1 if the patient has the disease, 0 otherwise. Let Y be a random variable that takes on one of the values {none, moderate, severe}. Then the joint distribution of X and Y might be the following: Pr[X=a,Y=b] Y none moderate severe X 0 0.72 0.18 0.00 1 0.02 0.05 0.03 In other words, Pr[X = 0, Y = none] = 0.72, so 72% of patients have neither the disease nor any symptoms. The table gives us all values of Pr[X = a, Y = b], so it completely specifies the joint distribution of X and Y. The joint distribution gives us all of the information about X and Y. Since random variables partition the sample space, we can use the total probability rule to obtain "marginal" distributions for X and Y: Pr[X = 0] = Pr[X = 0, Y = none] + Pr[X = 0, Y = moderate] + Pr[X = 0, Y = severe] = 0.90. We can do this for all values of X and Y by adding the values in the appropriate row or column of the table: Pr[X=a,Y=b] Y none moderate severe | Pr[X = a] X | 0 0.72 0.18 0.00 | 0.90 1 0.02 0.05 0.03 | 0.10 -------------------------------------' Pr[Y = b] 0.74 0.23 0.03 This implies that 10% of all patients have the disease, and 3% of all patients have severe symptoms. For independent random variables Q and R, we have Pr[Q = q, R = r] = Pr[Q = q] Pr[R = r], so the joint distribution is the product of the marginals when the random variables are independent. Recall that Y = b is just an event, so we can compute conditional probabilities given the event Y = b: Pr[X = 0 | Y = b] = Pr[X = 0, Y = b] / Pr[Y = b] Pr[X = 1 | Y = b] = Pr[X = 1, Y = b] / Pr[Y = b] This set of probabilities is called the "conditional distribution" of X given Y = b. We can write the conditional distributions Pr[X = a | Y = b] in table form as well: Pr[X=a|Y=b] Y none moderate severe X 0 0.97 0.78 0.00 1 0.03 0.22 1.00 We can similarly compute the conditional distributions Pr[Y = b | X = a]: Pr[Y=b|X=a] Y none moderate severe X 0 0.80 0.20 0.00 1 0.20 0.50 0.30 Like unconditional distributions, conditional distributions must sum to 1. Recall how we defined conditional probability: Pr[A|B] is the probability of A in a new sample space Ω' given by Ω' = B. So conditioning merely defines a new sample space, and all the rules of probability must hold in this new sample space. Let's do another example. Suppose Victor and I play rock-paper-scissors in order to determine who gets to grade problem 1 on midterm 2. Let X be my choice of weapon, Y Victor's choice. Then the joint distribution of X and Y might be: Pr[X=a,Y=b] Y rock paper scissors | Pr[X = a] X | rock 0.12 0.12 0.16 | 0.4 paper 0.09 0.09 0.12 | 0.3 scissors 0.09 0.09 0.12 | 0.3 -------------------------------------' Pr[Y = b] 0.3 0.3 0.4 As you can see, I am slightly biased towards rock, and Victor is slightly biased towards scissors. What's my probability of beating Victor and getting an easy problem to grade? Let W be 1 if I win, 0 if we draw, and -1 if I lose. Then we can compute the joint distribution of X and W. We note that if I choose rock, I win when he chooses scissors, I lose when he chooses paper, and we draw when he chooses rock. So Pr[X = rock, W = 1] = Pr[X = rock, Y = scissors] Pr[X = rock, W = -1] = Pr[X = rock, Y = paper] Pr[X = rock, W = 0] = Pr[X = rock, Y = rock]. Repeating this for all values of X, we get Pr[X=a,W=c] W +1 0 -1 | Pr[X = a] X | rock 0.16 0.12 0.12 | 0.4 paper 0.09 0.09 0.12 | 0.3 scissors 0.09 0.12 0.09 | 0.3 -------------------------------------' Pr[W = c] 0.34 0.33 0.33 So I have a slightly higher probability of winning than losing, which is good news for me. (I get to go home early from grading!) We can also compute conditional distributions Pr[W=c|X=a], which will tell me that I should choose rock (until he catches on, of course). Pr[W=c|X=a] W +1 0 -1 X rock 0.4 0.3 0.3 paper 0.3 0.3 0.4 scissors 0.3 0.4 0.3 As expected, since Victor is biased towards scissors, I am more likely to win if I choose rock and less likely if I choose paper. Conditional Probability Spaces Suppose that instead, we bet $1 on the outcome of each game of rock-paper-scissors. Then W is exactly the amount of money I win, and I would like to know how much I can expect to win if I play Victor many times. We can compute this from the marginal distribution of W: E(W) = 1 * 0.34 + 0 * 0.33 - 1 * 0.33 = 0.01. I would also like to know how much I can expect to win for each of my choices. That way, I have a better idea of what I should choose (again, until Victor catches on). So I want to know the "conditional expectation" of W given X = a for each a. Again, conditioning on the event X = a gives us a new sample space, and anything we can do in an arbitrary sample we can do in this new sample space. We just have to replace all our probabilities with conditional probabilities given by the new sample space. Doing so, we define conditional expectation as follows: E(W | E) = ∑_{c ∈ C} c * Pr[W = c | E], where E is any event and C is the set of all possible values that W can take on. So in this case, we have E(W | X = rock) = 1 * 0.4 + 0 * 0.3 - 1 * 0.3 = 0.1 E(W | X = paper) = 1 * 0.3 + 0 * 0.3 - 1 * 0.4 = -0.1 E(W | X = scissors) = 1 * 0.3 + 0 * 0.4 - 1 * 0.3 = 0 So I expect to win more if I choose rock. We can obtain the unconditional expectation E(W) from the conditional expectations as follows, where A is the set of values that X can take on: E(W) = ∑_{c∈C} c Pr[W = c] = ∑_{c∈C} c (∑_{a∈A} Pr[X=a]Pr[W=c|X=a]) (total probability rule) = ∑_{a∈A} Pr[X=a] (∑_{c∈C} c Pr[W=c|X=a]) = ∑_{a∈A} Pr[X=a] E(W|X=a). This is called the "total expectation law." In this case, we get E(W) = E(W | X = rock) Pr[X = rock] + E(W | X = paper) Pr[X = paper] + E(W | X = scissors) Pr[X = scissors] = 0.1 * 0.4 + 0 - 0.1 * 0.3 + 0 * 0.3 = 0.01, which is the same as our previous answer. In the next discussion section, you will see a neat way of computing the expectation of a geometric random variable using conditional expectation and the total expectation law. We can also define "conditional independence" for events A and B. A and B are independent conditional on C if Pr[A, B | C] = Pr[A|C] Pr[B|C]. Equivalently, A and B are independent given C if Pr[A | B, C] = Pr[A|C]. This tells us that if we are given C, then knowing B occurred gives us no information about whether or not A occurred. Note that Pr[A|B|C] is meaningless, since what is to the right of the bar determines our sample space. So in order to condition on two events B and C, we condition on their intersection. We can similarly define conditional independence for random variables. There are conditional versions of other probability rules as well. (1) inclusion/exclusion Pr[A∪B|C] = Pr[A|C] + Pr[B|C] - Pr[A,B|C] (2) total probability rule Pr[A|C] = Pr[A,B|C] + Pr[A,B|C] = Pr[A|B,C] Pr[B|C] + Pr[A|B,C] Pr[B|C] (3) Bayes' rule Pr[A|B,C] = Pr[B|A,C] Pr[A|C] / Pr[B|C] As an exercise, you may wish to prove some of these on your own.