Administrative info
  MT2 tomorrow
    Same location and policies as MT1
    Cover through polling/LLN (Wednesday)
  Review session
    Today 6:30-8 in 320 Soda
  Exams can be picked up from Soda front office
  PA3 due Thursday

Review
  Last time, we defined the joint distribution of two random variables
  X, Y as the set of probabilities
    Pr[X = a, Y =b]
  for all possible values of a and b.

  Then the set of vanilla probabilities
    Pr[X = a]
  is the marginal distribution of X.

  The set of conditional probabilities
    Pr[X = a | E]
  given an event E is the conditional distribution of X given E.

  Then we defined conditional expectation given E:
    E(X | E) = ∑_{a ∈ A} a * Pr[X = a | E],
  where A is the set of all possible values that X can take on.

  The total expectation law is the expectation analogue of the total
  probability rule. Given a random variable Y (or any partition of the
  sample space), we have:
    E(X) = ∑_{b∈B} Pr[Y=b] E(X|Y=b).

  We then defined conditional independence for events A and B. A and B
  are independent conditional on C if
    Pr[A, B | C] = Pr[A|C] Pr[B|C].
  Equivalently, A and B are independent given C if
    Pr[A | B, C] = Pr[A|C].
  This tells us that if we are given C, then knowing B occurred gives
  us no information about whether or not A occurred.

  Note that Pr[A|B|C] is meaningless, since what is to the right of
  the bar determines our sample space. So in order to condition on two
  events B and C, we condition on their intersection.

  We can similarly define conditional independence for random
  variables.

  There are conditional versions of other probability rules as well.
  (1) inclusion/exclusion
      Pr[A∪B|C] = Pr[A|C] + Pr[B|C] - Pr[A,B|C]
  (2) total probability rule
      Pr[A|C] = Pr[A,B|C] + Pr[A,B|C]
        = Pr[A|B,C] Pr[B|C] + Pr[A|B,C] Pr[B|C]
  (3) Bayes' rule
      Pr[A|B,C] = Pr[B|A,C] Pr[A|C] / Pr[B|C]
  As an exercise, you may wish to prove some of these on your own.

  We now return to Bayesian inference, using the new probability tools
  we have learned.

Inference
  Suppose I have a friend who challenges me to a game of flipping
  coins, where I win $1 if I correctly guess the outcome but lose $1
  if I don't. I know that my friend has been to a website that sells n
  different trick coins, each with a different probability p_i of
  heads, but I'm not sure which of the coins he purchased. What should
  I guess for the first flip? What should I guess for the second flip
  if the first one comes up heads? The third flip if the first two are
  both heads?

  Let X be a random variable that is i if the coin that my friend is
  using is the ith coin on the website. Since I have no idea which
  coin he is using, the "prior" distribution of X is
    Pr[X = i] = 1/n, i ∈ {1,2,...,n}.
  Our goal is to refine these probabilities based on our observations.
  This is the problem of "inference," where we attempt to determine a
  hidden quantity by making observations of events that are related to
  the hidden quantity.

  Let Y_j be a random variable that is H if the jth flip is heads, T
  otherwise. These are the observables, and we have
    Pr[Y_j = H | X = i] = p_i.
  We also note that the Y_j are conditionally independent given X,
  i.e. if we know that coin i is being used, then every flip has
  probability p_i of heads, independent of all other flips.

  Let's compute the probability that the first flip is heads, i.e.
    Pr[Y_1 = H].
  By the total probability rule,
    Pr[Y_1 = H] = ∑_{i=1}^n Pr[Y_1 = H | X = i] Pr[X = i]
      = ∑_{i=1}^n p_i/n
      = 1/n ∑_{i=1}^n p_i.

  This answers our first question: if
    Pr[Y_1 = H] = 1/n ∑_{i=1}^n p_i > 1/2,
  I should bet on heads, otherwise tails.

  Now if I see heads on the first flip, that changes the probabilities
  of each of the coins. In particular, if one of the coins has tails
  on both sides, then I know that my friend is not using that coin.
  More generally, we want the "posterior" distribution of X given the
  observation that the first flip is heads:
    Pr[X = i | Y_1 = H].
  By Bayes' rule, we have
    Pr[X = i | Y_1 = H] = Pr[Y_1 = H | X = i] Pr[X = i] / Pr[Y_1 = H]
      = (p_i 1/n) / (1/n ∑_{k=1}^n p_k)
      = p_i / (∑_{k=1}^n p_k).

  Now in order to answer our second question, we need to compute the
  probability of getting heads on the second flip, given that the
  first flip was heads:
    Pr[Y_2 = H | Y_1 = H].
  We follow the same procedure as in computing Pr[Y_1 = H], except
  that we use the conditional version of the total probability rule
  given Y_1 = H:
    Pr[Y_2 = H | Y_1 = H]
      = ∑_{i=1}^n Pr[Y_2 = H | Y_1 = H, X = i] Pr[X = i | Y_1 = H].
  Here, we note that as mentioned before, Y_1 and Y_2 are conditionally
  independent given X, so
    Pr[Y_2 = H | Y_1 = H, X = i]
      = Pr[Y_2 = H | X = i]
      = p_i.
  Plugging this in, we get
    Pr[Y_2 = H | Y_1 = H]
      = ∑_{i=1}^n (p_i p_i / ∑_{k=1}^n p_k)
      = (∑_{i=1}^n p_i^2) / (∑_{i=1}^n p_i).

  Now if this quantity is greater than 1/2, I should bet on heads.

  We can continue this procedure as we see more observations and
  learn more information about which coin is being used. First, we
  use the conditional version of Bayes' rule to compute
    Pr[X = i | Y_1 = H, Y_2 = H]
      = Pr[Y_2 = H | X = i, Y_1 = H] Pr[X = i | Y_1 = H] /
        Pr[Y_2 = H | Y_1 = H]
      = Pr[Y_2 = H | X = i] Pr[X = i | Y_1 = H] /
        Pr[Y_2 = H | Y_1 = H]              (conditional independence)
      = [p_i p_i / (∑_{k=1}^n p_k)] /
        [(∑_{k=1}^n p_k^2) / (∑_{k=1}^n p_k)]
      = p_i^2 / (∑_{i=k}^n p_k^2).
  We can proceed to compute
    Pr[Y_3 = H | Y_1 = H, Y_2 = H]
      = (∑_{i=1}^n p_i^3) / (∑_{i=1}^n p_i^2),
  which would tell us what to bet on in the third flip.

Iterative Update
  As you can see, the above procedure requires a lot of work each time
  we make an observation. We can, however, save ourselves much of this
  work by computing an update rule that, given any prior distribution,
  gives us the posterior distribution after making an observation.

  This can be done in the above general case, but for a simpler
  example, suppose I've checked my friend's browser history and have
  learned that he purchased two trick coins, the first with p_1 = 3/4
  and the second with p_2 = 1/2 (not a trick coin, but a trick trick
  coin!). Then let's start with an arbitrary prior distribution
    Pr[X = 1] = q
    Pr[X = 2] = 1 - q.

  Now we compute the probability that the first flip is heads:
    Pr[Y_1 = H] = Pr[Y_1 = H | X = 1] Pr[X = 1] +
                  Pr[Y_1 = H | X = 2] Pr[X = 2]
      = 3/4 q + 1/2 (1 - q)
      = 1/4 q + 1/2.
  Then if the first flip is heads, we compute the posterior
  distribution:
    Pr[X = 1 | Y_1 = H] = Pr[Y_1 = H | X = 1] Pr[X = 1] / Pr[Y_1 = H]
      = 3/4 q / (1/4 q + 1/2)
      = 3q / (q + 2)
    Pr[X = 2 | Y_1 = H] = Pr[Y_1 = H | X = 2] Pr[X = 2] / Pr[Y_1 = H]
      = (1/2)(1 - q) / (1/4 q + 1/2)
      = (2 - 2q) / (q + 2).

  This gives us the update rule for the distribution of X when we see
  heads:
    (q, 1-q) -> (3q/(q+2), (2-2q)/(q+2)).

  We can similarly compute an update rule for when we see tails.

  As a concrete example, suppose we start off with q = 1/2, so the
  prior distribution is
    (1/2, 1/2).
  Then if the first flip is heads, the posterior distribution is
    (1/2, 1/2) -> ((3/2)/(5/2), 1/(5/2)) = (3/5, 2/5).
  Then if the second flip is heads, the new posterior distribution is
    (3/5, 2/5) -> ((9/5)/(13/5), (4/5)/(13/5)) = (9/13, 4/13).
  We can continue this process as we make more observations.

  This procedure is called "recursive Bayesian estimation" and
  provides the ability to incrementally update the distribution as
  each observation is made.

Likelihood Ratios
  Instead of using probabilities, gamblers use odds ratios. For
  example, rather than expressing the probability of getting a four of
  a kind in a five card poker hand as 1/4165, they would say that the
  odds in favor of getting a four of a kind are 1:4164 or 1/4164,
  meaning that it is 1/4164 times as likely to get a four of a kind as
  it is to not get one. Similarly, for a roulette wheel, rather than
  saying that the probability of black is 9/19, they would say that
  the odds in favor of black are 9:10 or 9/10.

  Of course, we can get from probabilities to odds and back quite
  easily. If the probability of an event is p, then the odds in favor
  are
    p / (1-p).
  Similarly, if the odds of an event are a/b, then
    p = a / (a + b).

  Odds, or "likelihood ratios," can occasionally be easier to work
  with. In the coin flipping example above, if we only care about
  which of the coins is most likely (to determine whether or not I
  should accuse my friend of cheating), then we can save ourselves
  some work by using likelihood ratios.

  Consider the example where my friend has one of two coins, one with
  p_1 = 3/4 and the other p_2 = 1/2. Then the likelihood ratio in the
  arbitrary case is
    Pr[X = 1] / Pr[X = 2] = q / (1 - q).
  Then if we see heads, we want to compute a new likelihood ratio
    Pr[X = 1 | Y_1 = H] / Pr[X = 2 | Y_1 = H]
      = (Pr[Y_1 = H | X = 1] Pr[X = 1]) /
        (Pr[Y_1 = H | X = 2] Pr[X = 2]).
  Notice that we no longer have to compute Pr[Y_1 = H]. So we get
    Pr[X = 1 | Y_1 = H] / Pr[X = 2 | Y_1 = H]
      = (3/4 q) / (1/2 (1 - q))
      = 3/2 q / (1 - q).
  Thus, in order to compute a new likelihood ratio, we merely have to
  multiply the old one by 3/2.

  To repeat the example above, we start with a likelihood ratio of 1.
  Then if we see heads, we get 3/2. Then if we see another, we get
  9/4. And if we see another, we get 27/8. And if we see one more, we
  get 81/16, and it's a good bet that my friend is cheating.

  What if we see tails? We get
    Pr[X = 1 | Y_1 = T] / Pr[X = 2 | Y_1 = T]
      = (Pr[Y_1 = T | X = 1] Pr[X = 1]) /
        (Pr[Y_1 = T | X = 2] Pr[X = 2])
      = (1/4 q) / (1/2 (1 - q))
      = 1/2 q (1 - q).
  So we update our likelihood ratio by multiplying by 1/2.