Administrative info PA2 due today HW5 due tomorrow Review Recall that the conditional probability of event B given event A is Pr[B|A] = Pr[A ∩ B]/Pr[A]. Also recall that events A and B are independent if Pr[B|A] = Pr[A], or Pr[A ∩ B] = Pr[A] Pr[B]. Further recall the general product rule: Pr[A_1 ∩ ... ∩ A_n] = Pr[A_1] * Pr[A_2|A_1] * Pr[A_3|A_1 ∩ A_2] * ... * Pr[A_n | A_1 ∩ ... ∩ A_{n-1}]. Events A_1, ..., A_n are mutually independent if for any i in [1, n] and any subset I ⊆ {1, ..., n}\{i} (i.e. any subset I that does not contain i), we have Pr[A_i|∩_{j∈I} A_j] = A_i. Then it follows from the product rule that Pr[A_1 ∩ ... ∩ A_n] = Pr[A_1] Pr[A_2] ... Pr[A_n]. Recall Bayes' Rule: Pr[A|B] = Pr[B|A] * Pr[A] / Pr[B]. Recall the variations of the Total Probability Rule: Pr[B] = Pr[A ∩ B] + Pr[A ∩ B]. = Pr[B|A] Pr[A] + Pr[B|A] Pr[A] = Pr[B|A] Pr[A] + Pr[B|A] (1 - Pr[A]). Combining Bayes' Rule and the Total Probability Rule, we get Pr[A|B] = Pr[B|A]Pr[A] / (Pr[B|A]Pr[A]+Pr[B|A](1-Pr[A])). Base Rates Recall the HIV test from last time. We defined A to be the event that a random person has HIV, B to be the event that he tests positive. We computed Pr[B|A] = 0.99, Pr[B|A] = 0.99, and , Pr[B|A] = 0.01. We then computed that if we have a base rate of Pr[A] = 0.00025 in the entire population, then Pr[A|B] = 0.99 * 0.00025 / (0.99 * 0.00025 + 0.01 * 0.99975) ≈ 0.024. This tells us that blanket testing the entire population is not a good idea, since the test will produce far more false positives than actual positives. What if we only tested a subpopulation with a higher risk factor for HIV, say in which 1 in 5 people are infected? That changes the base rate to Pr[A] = 0.2, and we get Pr[A|B] = 0.99 * 0.2 / (0.99 * 0.2 + 0.01 * 0.8) ≈ 0.96. So if the base rate is much higher, the test is far more effective at detecting HIV. And if you have a high risk factor, this is a test you'd want to take. The takeaway here is that we can't ignore the base rate when evaluating the effectiveness of a test. While it doesn't make sense to blanket test the entire population, since its base rate is quite low, it does make sense to test subpopulations with much higher base rates. Inclusion/Exclusion Recall the inclusion/exclusion principle for events A and B: Pr[A ∪ B] = Pr[A] + Pr[B] - Pr[A ∩ B]. We count outcomes in A and in B, but that double counts outcomes in both, so we adjust by subtracting them off. What if we have three events? We get Pr[A ∩ B ∩ C] = Pr[A] + Pr[B] + Pr[C] - Pr[A ∩ B] - Pr[A ∩ C] - Pr[B ∩ C] + Pr[A ∩ B ∩ C]. By counting outcomes in A, B, and C, we double count those that appear in any pair of A, B, C, so we subtract those off. However, if an outcome appears in all three of A, B, C, then we've added three copies in the first line and subtracted three copies in the second line, so we have to add one copy in the third line to include those outcomes. This generalizes to larger numbers of events, with alternating additions and subtractions. (Can you see why it is called inclusion/exclusion?) See the reader for the general formula. EX: Recall the dice game from before. You pick a number from 1 to 6. The casino rolls three dice, and if your number comes up, you win. What is your probability of winning? ANS: Let A be the event that your number comes up on the first die, B on the second, and C on the third. Then you win for outcomes that are in A ∪ B ∪ C. So by inclusion/exclusion, Pr[A ∪ B ∪ C] = Pr[A] + Pr[B] + Pr[C] - Pr[A ∩ B] - Pr[A ∩ C] - Pr[B ∩ C] + Pr[A ∩ B ∩ C]. What is Pr[A]? Well, the probability that the first die has your number is 1/6, so Pr[A] = 1/6, and similarly, Pr[B] = Pr[C] = 1/6. What is Pr[A ∩ B]? The results on different dice are independent, so Pr[A ∩ B] = Pr[A] Pr[B] = 1/36, and similarly for Pr[A ∩ C] and Pr[B ∩ C]. By a similar argument, Pr[A ∩ B ∩ C] = 1/216. Then Pr[A ∪ B ∪ C] = 1/6 + 1/6 + 1/6 - 1/36 - 1/36 - 1/36 + 1/216 = 1/2 - 1/12 + 1/216 = 108/216 - 18/216 + 1/216 = 91/216 ≈ 0.42. This is the same answer as before, but it took a lot more work to get it. Union Bound From our reasoning for the inclusion/exclusion principle, we see that Pr[A_1] + ... + Pr[A_n] overstates the probability of Pr[A_1 ∪ ... ∪ A_n]. We can formalize this as the union bound: Pr[A_1 ∪ ... ∪ A_n] <= Pr[A_1] + ... + Pr[A_n]. EX: Suppose for MT2, to prevent students from cheating, we place on each desk in the lecture hall a random number from 1 to 1000. We give one question that is parameterized by that number. If two people sitting next to each other have the same number, then they can copy off each other. What is the probability that any of the 62 students will cheat? ANS: Computing this exactly seems hard, so let's just compute an upper bound. There are at most 61 pairs of students sitting next to each other (think of them all sitting in one long row). Let A_i be the event that the ith pair has the same number. Then Pr[A_i] = 1/1000. Let B be the event that some pair has the same number, B = A_1 ∪ ... ∪ A_61. Then Pr[B] <= Pr[A_1] + ... + Pr[A_61] = 61/1000. So the probability of any pair sharing the same number is at most 6%. Hashing Now that we've seen many techniques for computing probabilities, let us apply them to two problems of interest: hashing and coupon collecting. Recall the birthday paradox. We computed the probability that two people share the same birthday given 365 days and m people. We found that when m = 23, we have a slightly higher than even chance of two people sharing a birthday. Last week was Neptune's birthday! It was exactly one Neptunian year after it was first discovered in 1846. A year on Neptune is 89,666 Neptunian days. Now how many Neptunians do we need so that we have a better than even chance of two of them sharing the same birthday? Let's redo the analysis in the general case, where we have n days and m individuals. How many sample points are there? There are |Ω| = n^m, since each individual has n days to choose from and there are m individuals. Each of these is assumed to be equally likely. Now let E be the event that no two individuals share the same birthday. How many outcomes are in E? Well, the first person has n choices of days, the second person has n-1 choices that are different than the first, the third person has n-2 choices that are different thant the first two, and so on, until the mth person has n-(m-1) = n-m+1 choices. Thus, |E| = n * (n-1) * ... * (n-m+1), and Pr[E] = |E|/|Ω| = n * (n-1) * ... * (n-m+1) / n^m = n/n * (n-1)/n * ... * (n-m+1)/n. We can compute Pr[E] another way using the product rule. Let E_i be the event that the ith person's birthday is different than those of persons 1, ..., i-1. Then Pr[E] = Pr[E_1 ∩ E_2 ∩ ... ∩ E_m] = Pr[E_1] * Pr[E_2|E_1] * Pr[E_3|E_1 ∩ E_2] * ... * Pr[E_m|E_1 ∩ E_2 ∩ ... ∩ E_{m-1}]. Now we need to compute the probability Pr[E_i|E_1 ∩ ... ∩ E_{i-1}], the probability that the ith person's birthday is not the same as persons 1, ..., i-1 given that all those people have different birthdays. The ith person is left with n-(i-1) = n-i+1 choices of distinct days out of n days total, so Pr[E_i|E_1 ∩ ... ∩ E_{i-1}] = (n-i+1)/n. Plugging into the product rule, we get Pr[E] = (n-1+1)/n * (n-2+1)/n * ... * (n-m+1)/n = n/n * (n-1)/n * ... * (n-m+1)/n, as before. Let us rewrite (n-i)/n as (1 - i/n) to get Pr[E] = 1 * (1 - 1/n) * (1 - 2/n) * ... * (1 - (m-1)/n). Before we continue, let's look at the Taylor series for e^{-x}: e^{-x} = 1 - x + x^2/2! - x^3/3! + ... If x is small, then x^2/2! is really small, x^3/3! is ridiculously small, x^4/4! is ludicrously small, and so on. So we get e^{-x} >= 1 - x and if x is small, then they are very nearly equal. Using this approximation, we get Pr[E] = (1 - 1/n) * (1 - 2/n) * ... * (1 - (m-1)/n) <= e^{-1/n} * e^{-2/n} * ... * e^{-(m-1)/n} = exp(-(1/n + 2/n + ... + (m-1)/n)) = exp(-(1 + 2 + ... + (m-1))/n) = exp(-(m-1)m/2n) ≈ exp(-m^2/2n). Suppose we want to know when this probability is about 1/2. Then Pr[E] ≈ exp(-m^2/2n) ≈ 1/2 -m^2/2n = -ln(2) m^2 = 2n ln(2) m = sqrt(2n ln(2)) ≈ 1.18 sqrt(n). So when we have 1.18 sqrt(n) individuals, we have about an even chance that two individuals share the same birthday. In the case of Neptune, we plug in n = 89666 to get m = 1.18 sqrt(89666) ≈ 353. So we only need 353 Neptunians to make it likely that two of them share a birthday! This should make intuitive sense. When we have m people, there are C(m, 2) ≈ m^2/2 pairs of people, each pair of which has a 1/n chance of yielding a common birthday. What does this have to do with hashing? A hash table is a data structure for storing items. It it has n locations, then we use a hash function h(x) to map an item x to a location 0 <= h(x) < n. At each location, there is a linked list that stores all items that are mapped to that location. The longer the list, the slower basic operations on the hash table will be. Ideally, we want no two items to be mapped to the same location, i.e. no "collisions." Then the operations will take constant time. Suppose we store m items into the hash table. How large can m be so the the probability of a collision is less than 1/2? Before we calculate, let's outline some assumptions we are making: (1) For each item x, h(x) is uniformly random over [0, n-1], i.e. all n locations are equally likely. (2) The hash values for each item are mutually independent. Then this is just the birthday paradox! The n locations are our n days, and the m items are our m individuals, so we get m ≈ 1.18 sqrt(n). Another way to express this problem is in terms of balls and bins, where each location is a bin and each item is a ball. Then we are randomly throwing balls into bins. This abstraction is very useful in Computer Science. Finally, note that we made some approximations in the above analysis. In the reader, you can see a table that demonstrates that these approximations are very good even for small n. Coupon Collector's Problem Let's analyze a somewhat different problem. Suppose a local cereal manufacturer places a baseball card with a random Giants player in each box of cereal. There are n players who appear on a card, and each box contains a card chosen uniformly at random and independently from all other boxes. Now I am a big fan of the Kung Fu Panda, i.e. Pablo Sandoval. I really want his baseball card. How many boxes of cereal do I have to buy to make it more than likely to get his card? Suppose I buy m boxes of cereal. Let E be the event that I don't get a Panda card, E_i be the event that the ith box doesn't have his card. What is Pr[E_i]? Well, there are n cards, and n-1 don't have the Panda, so Pr[E_i] = (n-1)/n = (1 - 1/n). Then Pr[E] = Pr[E_1 ∩ ... ∩ E_m] = Pr[E_1] ... Pr[E_m] (mutual independence) = (1 - 1/n)^m. Using the Taylor expansion from before, Pr[E] <= (exp(-1/n))^m = exp(-m/n). Setting this equal to 1/2 for an even chance of getting a Panda card, we get 1/2 = exp(-m/n) -ln(2) = -m/n m = n ln(2) ≈ 0.69n. So if I buy 0.69n boxes, I have about an even chance of getting the Panda. Suppose I want all n players. (I like The Beard (Brian Wilson), Buster Posey, and the rest of the Giants as well.) Now how many boxes do I have to buy to have an even chance of getting all the players? Let F_j be the event that I don't get the jth player, F be the event that I am missing some player. Then F = F_1 ∪ ... ∪ F_n. Note that the F_j are not independent! Knowing that I didn't get a Panda card makes it more likely that I got someone else's. We already saw that Pr[F_j] <= exp(-m/n). Then we have Pr[F] = Pr[F_1 ∪ ... ∪ F_n] <= Pr[F_1] + ... + Pr[F_n] (union bound) <= n exp(-m/n). Setting this to 1/2, we get 1/2 = n exp(-m/n) 1/(2n) = exp(-m/n) -ln(2n) = -m/n m = n ln(2n) So n ln(2n) are sufficient to guarantee an even chance of getting all players. As you can see, we need many boxes to make it likely that we find the player we like or assemble a full collection of all players. So this is a great marketing ploy for the cereal manufacturer. Why did we do these examples above? They illustrate how the probability techniques we learned can be applied to solve real-world problems.