Administrative info MT2 tomorrow Same location and policies as MT1 Cover through polling/LLN (Wednesday) Review session Today 6:30-8 in 320 Soda Exams can be picked up from Soda front office PA3 due Thursday Review Last time, we defined the joint distribution of two random variables X, Y as the set of probabilities Pr[X = a, Y =b] for all possible values of a and b. Then the set of vanilla probabilities Pr[X = a] is the marginal distribution of X. The set of conditional probabilities Pr[X = a | E] given an event E is the conditional distribution of X given E. Then we defined conditional expectation given E: E(X | E) = ∑_{a ∈ A} a * Pr[X = a | E], where A is the set of all possible values that X can take on. The total expectation law is the expectation analogue of the total probability rule. Given a random variable Y (or any partition of the sample space), we have: E(X) = ∑_{b∈B} Pr[Y=b] E(X|Y=b). We then defined conditional independence for events A and B. A and B are independent conditional on C if Pr[A, B | C] = Pr[A|C] Pr[B|C]. Equivalently, A and B are independent given C if Pr[A | B, C] = Pr[A|C]. This tells us that if we are given C, then knowing B occurred gives us no information about whether or not A occurred. Note that Pr[A|B|C] is meaningless, since what is to the right of the bar determines our sample space. So in order to condition on two events B and C, we condition on their intersection. We can similarly define conditional independence for random variables. There are conditional versions of other probability rules as well. (1) inclusion/exclusion Pr[A∪B|C] = Pr[A|C] + Pr[B|C] - Pr[A,B|C] (2) total probability rule Pr[A|C] = Pr[A,B|C] + Pr[A,B|C] = Pr[A|B,C] Pr[B|C] + Pr[A|B,C] Pr[B|C] (3) Bayes' rule Pr[A|B,C] = Pr[B|A,C] Pr[A|C] / Pr[B|C] As an exercise, you may wish to prove some of these on your own. We now return to Bayesian inference, using the new probability tools we have learned. Inference Suppose I have a friend who challenges me to a game of flipping coins, where I win $1 if I correctly guess the outcome but lose $1 if I don't. I know that my friend has been to a website that sells n different trick coins, each with a different probability p_i of heads, but I'm not sure which of the coins he purchased. What should I guess for the first flip? What should I guess for the second flip if the first one comes up heads? The third flip if the first two are both heads? Let X be a random variable that is i if the coin that my friend is using is the ith coin on the website. Since I have no idea which coin he is using, the "prior" distribution of X is Pr[X = i] = 1/n, i ∈ {1,2,...,n}. Our goal is to refine these probabilities based on our observations. This is the problem of "inference," where we attempt to determine a hidden quantity by making observations of events that are related to the hidden quantity. Let Y_j be a random variable that is H if the jth flip is heads, T otherwise. These are the observables, and we have Pr[Y_j = H | X = i] = p_i. We also note that the Y_j are conditionally independent given X, i.e. if we know that coin i is being used, then every flip has probability p_i of heads, independent of all other flips. Let's compute the probability that the first flip is heads, i.e. Pr[Y_1 = H]. By the total probability rule, Pr[Y_1 = H] = ∑_{i=1}^n Pr[Y_1 = H | X = i] Pr[X = i] = ∑_{i=1}^n p_i/n = 1/n ∑_{i=1}^n p_i. This answers our first question: if Pr[Y_1 = H] = 1/n ∑_{i=1}^n p_i > 1/2, I should bet on heads, otherwise tails. Now if I see heads on the first flip, that changes the probabilities of each of the coins. In particular, if one of the coins has tails on both sides, then I know that my friend is not using that coin. More generally, we want the "posterior" distribution of X given the observation that the first flip is heads: Pr[X = i | Y_1 = H]. By Bayes' rule, we have Pr[X = i | Y_1 = H] = Pr[Y_1 = H | X = i] Pr[X = i] / Pr[Y_1 = H] = (p_i 1/n) / (1/n ∑_{k=1}^n p_k) = p_i / (∑_{k=1}^n p_k). Now in order to answer our second question, we need to compute the probability of getting heads on the second flip, given that the first flip was heads: Pr[Y_2 = H | Y_1 = H]. We follow the same procedure as in computing Pr[Y_1 = H], except that we use the conditional version of the total probability rule given Y_1 = H: Pr[Y_2 = H | Y_1 = H] = ∑_{i=1}^n Pr[Y_2 = H | Y_1 = H, X = i] Pr[X = i | Y_1 = H]. Here, we note that as mentioned before, Y_1 and Y_2 are conditionally independent given X, so Pr[Y_2 = H | Y_1 = H, X = i] = Pr[Y_2 = H | X = i] = p_i. Plugging this in, we get Pr[Y_2 = H | Y_1 = H] = ∑_{i=1}^n (p_i p_i / ∑_{k=1}^n p_k) = (∑_{i=1}^n p_i^2) / (∑_{i=1}^n p_i). Now if this quantity is greater than 1/2, I should bet on heads. We can continue this procedure as we see more observations and learn more information about which coin is being used. First, we use the conditional version of Bayes' rule to compute Pr[X = i | Y_1 = H, Y_2 = H] = Pr[Y_2 = H | X = i, Y_1 = H] Pr[X = i | Y_1 = H] / Pr[Y_2 = H | Y_1 = H] = Pr[Y_2 = H | X = i] Pr[X = i | Y_1 = H] / Pr[Y_2 = H | Y_1 = H] (conditional independence) = [p_i p_i / (∑_{k=1}^n p_k)] / [(∑_{k=1}^n p_k^2) / (∑_{k=1}^n p_k)] = p_i^2 / (∑_{i=k}^n p_k^2). We can proceed to compute Pr[Y_3 = H | Y_1 = H, Y_2 = H] = (∑_{i=1}^n p_i^3) / (∑_{i=1}^n p_i^2), which would tell us what to bet on in the third flip. Iterative Update As you can see, the above procedure requires a lot of work each time we make an observation. We can, however, save ourselves much of this work by computing an update rule that, given any prior distribution, gives us the posterior distribution after making an observation. This can be done in the above general case, but for a simpler example, suppose I've checked my friend's browser history and have learned that he purchased two trick coins, the first with p_1 = 3/4 and the second with p_2 = 1/2 (not a trick coin, but a trick trick coin!). Then let's start with an arbitrary prior distribution Pr[X = 1] = q Pr[X = 2] = 1 - q. Now we compute the probability that the first flip is heads: Pr[Y_1 = H] = Pr[Y_1 = H | X = 1] Pr[X = 1] + Pr[Y_1 = H | X = 2] Pr[X = 2] = 3/4 q + 1/2 (1 - q) = 1/4 q + 1/2. Then if the first flip is heads, we compute the posterior distribution: Pr[X = 1 | Y_1 = H] = Pr[Y_1 = H | X = 1] Pr[X = 1] / Pr[Y_1 = H] = 3/4 q / (1/4 q + 1/2) = 3q / (q + 2) Pr[X = 2 | Y_1 = H] = Pr[Y_1 = H | X = 2] Pr[X = 2] / Pr[Y_1 = H] = (1/2)(1 - q) / (1/4 q + 1/2) = (2 - 2q) / (q + 2). This gives us the update rule for the distribution of X when we see heads: (q, 1-q) -> (3q/(q+2), (2-2q)/(q+2)). We can similarly compute an update rule for when we see tails. As a concrete example, suppose we start off with q = 1/2, so the prior distribution is (1/2, 1/2). Then if the first flip is heads, the posterior distribution is (1/2, 1/2) -> ((3/2)/(5/2), 1/(5/2)) = (3/5, 2/5). Then if the second flip is heads, the new posterior distribution is (3/5, 2/5) -> ((9/5)/(13/5), (4/5)/(13/5)) = (9/13, 4/13). We can continue this process as we make more observations. This procedure is called "recursive Bayesian estimation" and provides the ability to incrementally update the distribution as each observation is made. Likelihood Ratios Instead of using probabilities, gamblers use odds ratios. For example, rather than expressing the probability of getting a four of a kind in a five card poker hand as 1/4165, they would say that the odds in favor of getting a four of a kind are 1:4164 or 1/4164, meaning that it is 1/4164 times as likely to get a four of a kind as it is to not get one. Similarly, for a roulette wheel, rather than saying that the probability of black is 9/19, they would say that the odds in favor of black are 9:10 or 9/10. Of course, we can get from probabilities to odds and back quite easily. If the probability of an event is p, then the odds in favor are p / (1-p). Similarly, if the odds of an event are a/b, then p = a / (a + b). Odds, or "likelihood ratios," can occasionally be easier to work with. In the coin flipping example above, if we only care about which of the coins is most likely (to determine whether or not I should accuse my friend of cheating), then we can save ourselves some work by using likelihood ratios. Consider the example where my friend has one of two coins, one with p_1 = 3/4 and the other p_2 = 1/2. Then the likelihood ratio in the arbitrary case is Pr[X = 1] / Pr[X = 2] = q / (1 - q). Then if we see heads, we want to compute a new likelihood ratio Pr[X = 1 | Y_1 = H] / Pr[X = 2 | Y_1 = H] = (Pr[Y_1 = H | X = 1] Pr[X = 1]) / (Pr[Y_1 = H | X = 2] Pr[X = 2]). Notice that we no longer have to compute Pr[Y_1 = H]. So we get Pr[X = 1 | Y_1 = H] / Pr[X = 2 | Y_1 = H] = (3/4 q) / (1/2 (1 - q)) = 3/2 q / (1 - q). Thus, in order to compute a new likelihood ratio, we merely have to multiply the old one by 3/2. To repeat the example above, we start with a likelihood ratio of 1. Then if we see heads, we get 3/2. Then if we see another, we get 9/4. And if we see another, we get 27/8. And if we see one more, we get 81/16, and it's a good bet that my friend is cheating. What if we see tails? We get Pr[X = 1 | Y_1 = T] / Pr[X = 2 | Y_1 = T] = (Pr[Y_1 = T | X = 1] Pr[X = 1]) / (Pr[Y_1 = T | X = 2] Pr[X = 2]) = (1/4 q) / (1/2 (1 - q)) = 1/2 q (1 - q). So we update our likelihood ratio by multiplying by 1/2.