Administrative info PA2 due Monday HW5 due Tuesday MT1 grades up, μ ≈ 68.6, σ ≈ 15.8 MT1 solutions will be posted with some more comments Review Recall that the conditional property of event B given event A is Pr[B|A] = Pr[A ∩ B]/Pr[A]. Also recall that events A and B are independent if Pr[B|A] = Pr[A], or Pr[A ∩ B] = Pr[A] Pr[B]. Mutual Independence Recall the coin flipping example from last time. We argued that the probability of getting an outcome containing k heads in n flips of a biased coin with probability p of heads is p^k (1-p)^(n-k). Let's formalize this argument. Let A_i be the event that the ith flip comes up heads. It's reasonable to conclude that Pr[A_i] = p. Now suppose we have a particular outcome with k heads, such as the outcome ω = A^k T^(n-k) where the first k flips are heads and the rest are tails. This is the sole outcome in the event E = A_1 ∩ ... ∩ A_k ∩ A_{k+1} ∩ A_n, so Pr[ω] = Pr[E]. We've already argued that Pr[A_i] = p and Pr[A_i] = 1-p. How to we compute the probability of the intersection Pr[E]? We know from the definition of conditional probability that Pr[B|A] = Pr[A ∩ B]/Pr[A], so Pr[A ∩ B] = Pr[B|A] Pr[B]. We can generalize this to a product rule for n events A_1, ..., A_n: Pr[A_1 ∩ ... ∩ A_n] = Pr[A_1] * Pr[A_2|A_1] * Pr[A_3|A_1 ∩ A_2] * ... * Pr[A_n | A_1 ∩ ... ∩ A_{n-1}]. We prove this using induction over n. Base case: n = 1. Then Pr[A_1] = Pr[A_1] is trivially true. Inductive hypothesis: For some n >= 1, the above equality holds. Inductive step: Then Pr[A_1 ∩ ... ∩ A_{n+1}] = Pr[(A_1 ∩ ... &cap A_n) ∩ A_{n+1}] = Pr[A_{n+1}|A_1 ∩ ... &cap A_n] * Pr[A_1 ∩ ... &cap A_n] (by def. of cond. prob.) = Pr[A_{n+1}|A_1 ∩ ... &cap A_n] * Pr[A_1] * Pr[A_2|A_1] * Pr[A_3|A_1 ∩ A_2] * ... * Pr[A_n|A_1 ∩ ... ∩ A_{n-1}] (by IH) as required. We can also define "mutual independence" for events. Events A_1, ..., A_n are mutually independent if for any i in [1, n] and any subset I ⊆ {1, ..., n}\{i} (i.e. any subset I that does not contain i), we have Pr[A_i|∩_{j∈I} A_j] = A_i. In other words, knowing that any combination of other events happened gives us no information on whether or not A_i happened. Then it follows from the product rule that Pr[A_1 ∩ ... ∩ A_n] = Pr[A_1] Pr[A_2] ... Pr[A_n]. Note that mutual independence is not the same as pairwise independence. Consider a roll of a red die and a blue die. Let A be the event that the red die shows 1, B be the event that the blue die shows 1, and C be the event that the sum of the two dice is 7. A and B are clearly independent, and we showed last time that A and C are independent. Similarly, B and C are independent. (If you don't believe these claims, calculate each of the conditional probabilities Pr[A|B], Pr[A|C], Pr[B|C].) But it is impossible for both dice to show 1 and sum to 7, so Pr[A ∩ B ∩ C] = 0 ≠ Pr[A] Pr[B] Pr[C] = 1/216, and they are not mutually independent. Now we can go back to flipping coins. We have that E = A_1 ∩ ... ∩ A_k ∩ A_{k+1} ∩ A_n, and the A_i are mutually independent, so Pr[E] = Pr[A_1] ... Pr[A_k] Pr[A_{k+1}] ... Pr[A_n] = p * ... * p * (1-p) * ... * (1-p) = p^k (1-p)^(n-k). Since Pr[ω] = Pr[E], Pr[ω] = p^k (1-p)^(n-k). Tree Diagrams Suppose a have two coins in my pocket, when that is a fair coin and one that has two heads. If I take a random coin out of my pocket and flip it twice, what are the possible outcomes and their probabilities? We can draw a tree diagram to compute this. We can either choose the fair coin or the biased coin, each with probability 1/2. Let F be the event that we pick the fair coin, Pr[F] = 1/2. Let H_1 be the event that we get heads in the first flip, H_2 that we get heads in the second flip. Then if we used the fair coin, we get heads with probability Pr[H_1|F] = 1/2, tails with probability Pr[H_1|F] = 1/2. Similarly, if we used the fair coin and got heads on the first flip, we get heads on the second flip with probability Pr[H_2|F ∩ H_1] = 1/2, tails with probability Pr[H_2|F ∩ H_1] = 1/2. We can continue in this way until we've completed our decision tree. Pr[F] = 1/2 Pr[H_1|F] = 1/2 Pr[H_2|F ∩ H_1] = 1/2 1/8 +-------------+-----------------+----------------------- (f, h, h) | | | | | | Pr[H_2|F ∩ H_1] = 1/2 1/8 | | `----------------------- (f, h, t) | | | | Pr[H_1|F] = 1/2 Pr[H_2|F ∩ H_1] = 1/2 1/8 | `-----------------+----------------------- (f, t, h) | | | | Pr[H_2|F ∩ H_1] = 1/2 1/8 | `----------------------- (f, t, t) | | Pr[F] = 1/2 Pr[H_1|F] = 1 Pr[H_2|F ∩ H_1] = 1 1/2 `-------------+-----------------+----------------------- (b, h, h) What is the probability of the outcome (f, h, h) in which we pick the fair coin and get two heads? This is the sole outcome in the event F ∩ H_1 ∩ H_2. We can compute its probability by multiplying the conditional probabilities along the edges from the root of the tree to the leaf corresponding to that outcome: Pr[(f, h, h)] = Pr[F ∩ H_1 ∩ H_2] = Pr[F] Pr[H_1|F] Pr[H_2|F ∩ H_1] = 1/8. Note that this follows from the product rule. We can compute the probabilities of the remaining outcomes in the same way, as given in the diagram above. Tree diagrams are not necessarily unique; there may be more than one that adequately represents the sample space. In this example, we could have defined events HH, HT, TH, and TT, which together cover all possible results from flipping the coins twice (i.e. they partition the sample space; more on that later). Then the following two-level tree also represents the sample space: Pr[F] = 1/2 Pr[HH|F] = 1/4 1/8 +-------------+---------------- (f, h, h) | | | | Pr[HT|F] = 1/4 1/8 | +---------------- (f, h, t) | | | | Pr[TH|F] = 1/4 1/8 | +---------------- (f, t, h) | | | | Pr[TT|F] = 1/4 1/8 | `---------------- (f, t, t) | | Pr[F] = 1/2 Pr[HH|F] = 1 1/8 `-------------+---------------- (b, h, h) Probabilities of outcomes are again computed using the product rule. (We will formalize later, when we talk about conditional independence, how we came up with Pr[HH|F] = 1/4. For now, it should be obvious that the probability of getting two heads once we've chosen the fair coin is 1/4.) Along with the coin flipping example above, this illustrates how probability models are constructed. We reduce an experiment to a sequence of simple choices and then use the product rule, computing conditional probabilities or relying on independence, to determine the probabilities of each outcome. Bayes' Rule Recall our motivating example from last time. A pharmaceutical company is marketing a new test for HIV that it claims is 99% effective, meaning that it will report positive for 99% of people who have HIV and negative for 99% of those who don't have HIV. Suppose a random person takes the test and gets a positive test result. What is the probability that the person has HIV? Let A be the event that the person has HIV, B be the event that he tests positive. We know that if he has HIV, he will test positive with probability 0.99, so Pr[B|A] = 0.99. Similarly, he tests negative with probability 0.99 if he doesn't have HIV, so Pr[B|A] = 0.99. We can also compute Pr[B|A] = 1 - Pr[B|A] = 0.01, and similarly, Pr[B|A] = 0.01. Now we want to compute Pr[A|B]. How can we do so given the information we have? We can do the following: Pr[A|B] = Pr[A ∩ B] / Pr[B] (by def. of cond. prob.) = Pr[B|A] * Pr[A] / Pr[B]. (by def. of cond. prob.) This is called Bayes' Rule. A "partition" of an event B is a set of mutually disjoint events A_1, ..., A_n such that B = A_1 ∪ ... ∪ A_n. Then we get Pr[B] = Pr[A_1 ∪ ... ∪ A_n] = Pr[A_1] + ... + Pr[A_n] since A_1, ..., A_n are mutually disjoint. Now suppose that A_1, ..., A_n partition Ω, the sample space as a whole. Then Pr[A_1 ∪ ... ∪ A_n] = Pr[A_1] + ... + Pr[A_n] = 1 ≠ Pr[B]. How can we get an expression for Pr[B] from these events? From a Venn diagram, we can see that A_1 ∩ B, A_2 ∩ B, ..., A_n ∩ B partition B. So Pr[B] = Pr[(A_1 ∩ B) ∪ ... ∪ (A_n ∩ B)] = Pr[A_1 ∩ B] + ... + Pr[A_n ∩ B]. Finally consider a single event A. Then the events A ∩ B and A ∩ B are a partition of B. A Venn diagram shows that this is the case, but intuitively, any outcome in B is either in A and therefore in A ∩ B or is in A and therefore in A ∩ B. Then it follows that Pr[B] = Pr[A ∩ B] + Pr[A ∩ B]. Equivalently, by using the definition of conditional probability, Pr[B] = Pr[B|A] Pr[A] + Pr[B|A] Pr[A] = Pr[B|A] Pr[A] + Pr[B|A] (1 - Pr[A]). Both of the above are known as the Total Probability Rule. Combining Bayes' Rule and the Total Probability Rule, we get Pr[A|B] = Pr[B|A]Pr[A] / (Pr[B|A]Pr[A]+Pr[B|A](1-Pr[A])). Now we have almost everything we need, except that we don't have Pr[A], the probability that a random person has HIV. This turns out to be (in the US) 250 out of every million people, so Pr[A] = 0.00025. Plugging into the above, we get Pr[A|B] = 0.99 * 0.00025 / (0.99 * 0.00025 + 0.01 * 0.99975) ≈ 0.024. So the person only has a 2.4% chance of having HIV! This is much smaller than the claimed 99% accuracy. This demonstrates that Pr[A|B], which is what we care about, can be very different from Pr[B|A], which is what the manufacturer is telling us. Confusing the two is known as a "base rate fallacy." Here's some intuition on the result. Suppose 4000 random people come in to get tested. Around 1 of the 4000 people will actually have HIV and will most likely test positive. Around 3999 people won't have HIV, but around 40 of them will test positive. So of the 41 people who test positive, only 1 actually has HIV, so a random person who tests positive has about a 1/41 chance of having HIV. Note, however, that this doesn't mean the test is useless. If a particular person goes in to be tested whose specific risk factors substantially increase Pr[A], then Pr[A|B] would be much higher. Suppose the person is a member of a subpopulation in which 1 in 5 people have HIV. Then Pr[A|B] = 0.99 * 0.2 / (0.99 * 0.2 + 0.01 * 0.8) ≈ 0.96. So if the base rate is much higher, the test is far more effective at detecting HIV. The takeaway here is that we can't ignore the base rate when evaluating the effectiveness of a test. While it doesn't make sense to blanket test the entire population, since its base rate is quite low, it does make sense to test subpopulations with much higher base rates.