Administrative info
MT2 tomorrow
Same location and policies as MT1
Cover through polling/LLN (Wednesday)
Review session
Today 6:30-8 in 320 Soda
Exams can be picked up from Soda front office
PA3 due Thursday
Review
Last time, we defined the joint distribution of two random variables
X, Y as the set of probabilities
Pr[X = a, Y =b]
for all possible values of a and b.
Then the set of vanilla probabilities
Pr[X = a]
is the marginal distribution of X.
The set of conditional probabilities
Pr[X = a | E]
given an event E is the conditional distribution of X given E.
Then we defined conditional expectation given E:
E(X | E) = ∑_{a ∈ A} a * Pr[X = a | E],
where A is the set of all possible values that X can take on.
The total expectation law is the expectation analogue of the total
probability rule. Given a random variable Y (or any partition of the
sample space), we have:
E(X) = ∑_{b∈B} Pr[Y=b] E(X|Y=b).
We then defined conditional independence for events A and B. A and B
are independent conditional on C if
Pr[A, B | C] = Pr[A|C] Pr[B|C].
Equivalently, A and B are independent given C if
Pr[A | B, C] = Pr[A|C].
This tells us that if we are given C, then knowing B occurred gives
us no information about whether or not A occurred.
Note that Pr[A|B|C] is meaningless, since what is to the right of
the bar determines our sample space. So in order to condition on two
events B and C, we condition on their intersection.
We can similarly define conditional independence for random
variables.
There are conditional versions of other probability rules as well.
(1) inclusion/exclusion
Pr[A∪B|C] = Pr[A|C] + Pr[B|C] - Pr[A,B|C]
(2) total probability rule
Pr[A|C] = Pr[A,B|C] + Pr[A,B|C]
= Pr[A|B,C] Pr[B|C] + Pr[A|B,C] Pr[B|C]
(3) Bayes' rule
Pr[A|B,C] = Pr[B|A,C] Pr[A|C] / Pr[B|C]
As an exercise, you may wish to prove some of these on your own.
We now return to Bayesian inference, using the new probability tools
we have learned.
Inference
Suppose I have a friend who challenges me to a game of flipping
coins, where I win $1 if I correctly guess the outcome but lose $1
if I don't. I know that my friend has been to a website that sells n
different trick coins, each with a different probability p_i of
heads, but I'm not sure which of the coins he purchased. What should
I guess for the first flip? What should I guess for the second flip
if the first one comes up heads? The third flip if the first two are
both heads?
Let X be a random variable that is i if the coin that my friend is
using is the ith coin on the website. Since I have no idea which
coin he is using, the "prior" distribution of X is
Pr[X = i] = 1/n, i ∈ {1,2,...,n}.
Our goal is to refine these probabilities based on our observations.
This is the problem of "inference," where we attempt to determine a
hidden quantity by making observations of events that are related to
the hidden quantity.
Let Y_j be a random variable that is H if the jth flip is heads, T
otherwise. These are the observables, and we have
Pr[Y_j = H | X = i] = p_i.
We also note that the Y_j are conditionally independent given X,
i.e. if we know that coin i is being used, then every flip has
probability p_i of heads, independent of all other flips.
Let's compute the probability that the first flip is heads, i.e.
Pr[Y_1 = H].
By the total probability rule,
Pr[Y_1 = H] = ∑_{i=1}^n Pr[Y_1 = H | X = i] Pr[X = i]
= ∑_{i=1}^n p_i/n
= 1/n ∑_{i=1}^n p_i.
This answers our first question: if
Pr[Y_1 = H] = 1/n ∑_{i=1}^n p_i > 1/2,
I should bet on heads, otherwise tails.
Now if I see heads on the first flip, that changes the probabilities
of each of the coins. In particular, if one of the coins has tails
on both sides, then I know that my friend is not using that coin.
More generally, we want the "posterior" distribution of X given the
observation that the first flip is heads:
Pr[X = i | Y_1 = H].
By Bayes' rule, we have
Pr[X = i | Y_1 = H] = Pr[Y_1 = H | X = i] Pr[X = i] / Pr[Y_1 = H]
= (p_i 1/n) / (1/n ∑_{k=1}^n p_k)
= p_i / (∑_{k=1}^n p_k).
Now in order to answer our second question, we need to compute the
probability of getting heads on the second flip, given that the
first flip was heads:
Pr[Y_2 = H | Y_1 = H].
We follow the same procedure as in computing Pr[Y_1 = H], except
that we use the conditional version of the total probability rule
given Y_1 = H:
Pr[Y_2 = H | Y_1 = H]
= ∑_{i=1}^n Pr[Y_2 = H | Y_1 = H, X = i] Pr[X = i | Y_1 = H].
Here, we note that as mentioned before, Y_1 and Y_2 are conditionally
independent given X, so
Pr[Y_2 = H | Y_1 = H, X = i]
= Pr[Y_2 = H | X = i]
= p_i.
Plugging this in, we get
Pr[Y_2 = H | Y_1 = H]
= ∑_{i=1}^n (p_i p_i / ∑_{k=1}^n p_k)
= (∑_{i=1}^n p_i^2) / (∑_{i=1}^n p_i).
Now if this quantity is greater than 1/2, I should bet on heads.
We can continue this procedure as we see more observations and
learn more information about which coin is being used. First, we
use the conditional version of Bayes' rule to compute
Pr[X = i | Y_1 = H, Y_2 = H]
= Pr[Y_2 = H | X = i, Y_1 = H] Pr[X = i | Y_1 = H] /
Pr[Y_2 = H | Y_1 = H]
= Pr[Y_2 = H | X = i] Pr[X = i | Y_1 = H] /
Pr[Y_2 = H | Y_1 = H] (conditional independence)
= [p_i p_i / (∑_{k=1}^n p_k)] /
[(∑_{k=1}^n p_k^2) / (∑_{k=1}^n p_k)]
= p_i^2 / (∑_{i=k}^n p_k^2).
We can proceed to compute
Pr[Y_3 = H | Y_1 = H, Y_2 = H]
= (∑_{i=1}^n p_i^3) / (∑_{i=1}^n p_i^2),
which would tell us what to bet on in the third flip.
Iterative Update
As you can see, the above procedure requires a lot of work each time
we make an observation. We can, however, save ourselves much of this
work by computing an update rule that, given any prior distribution,
gives us the posterior distribution after making an observation.
This can be done in the above general case, but for a simpler
example, suppose I've checked my friend's browser history and have
learned that he purchased two trick coins, the first with p_1 = 3/4
and the second with p_2 = 1/2 (not a trick coin, but a trick trick
coin!). Then let's start with an arbitrary prior distribution
Pr[X = 1] = q
Pr[X = 2] = 1 - q.
Now we compute the probability that the first flip is heads:
Pr[Y_1 = H] = Pr[Y_1 = H | X = 1] Pr[X = 1] +
Pr[Y_1 = H | X = 2] Pr[X = 2]
= 3/4 q + 1/2 (1 - q)
= 1/4 q + 1/2.
Then if the first flip is heads, we compute the posterior
distribution:
Pr[X = 1 | Y_1 = H] = Pr[Y_1 = H | X = 1] Pr[X = 1] / Pr[Y_1 = H]
= 3/4 q / (1/4 q + 1/2)
= 3q / (q + 2)
Pr[X = 2 | Y_1 = H] = Pr[Y_1 = H | X = 2] Pr[X = 2] / Pr[Y_1 = H]
= (1/2)(1 - q) / (1/4 q + 1/2)
= (2 - 2q) / (q + 2).
This gives us the update rule for the distribution of X when we see
heads:
(q, 1-q) -> (3q/(q+2), (2-2q)/(q+2)).
We can similarly compute an update rule for when we see tails.
As a concrete example, suppose we start off with q = 1/2, so the
prior distribution is
(1/2, 1/2).
Then if the first flip is heads, the posterior distribution is
(1/2, 1/2) -> ((3/2)/(5/2), 1/(5/2)) = (3/5, 2/5).
Then if the second flip is heads, the new posterior distribution is
(3/5, 2/5) -> ((9/5)/(13/5), (4/5)/(13/5)) = (9/13, 4/13).
We can continue this process as we make more observations.
This procedure is called "recursive Bayesian estimation" and
provides the ability to incrementally update the distribution as
each observation is made.
Likelihood Ratios
Instead of using probabilities, gamblers use odds ratios. For
example, rather than expressing the probability of getting a four of
a kind in a five card poker hand as 1/4165, they would say that the
odds in favor of getting a four of a kind are 1:4164 or 1/4164,
meaning that it is 1/4164 times as likely to get a four of a kind as
it is to not get one. Similarly, for a roulette wheel, rather than
saying that the probability of black is 9/19, they would say that
the odds in favor of black are 9:10 or 9/10.
Of course, we can get from probabilities to odds and back quite
easily. If the probability of an event is p, then the odds in favor
are
p / (1-p).
Similarly, if the odds of an event are a/b, then
p = a / (a + b).
Odds, or "likelihood ratios," can occasionally be easier to work
with. In the coin flipping example above, if we only care about
which of the coins is most likely (to determine whether or not I
should accuse my friend of cheating), then we can save ourselves
some work by using likelihood ratios.
Consider the example where my friend has one of two coins, one with
p_1 = 3/4 and the other p_2 = 1/2. Then the likelihood ratio in the
arbitrary case is
Pr[X = 1] / Pr[X = 2] = q / (1 - q).
Then if we see heads, we want to compute a new likelihood ratio
Pr[X = 1 | Y_1 = H] / Pr[X = 2 | Y_1 = H]
= (Pr[Y_1 = H | X = 1] Pr[X = 1]) /
(Pr[Y_1 = H | X = 2] Pr[X = 2]).
Notice that we no longer have to compute Pr[Y_1 = H]. So we get
Pr[X = 1 | Y_1 = H] / Pr[X = 2 | Y_1 = H]
= (3/4 q) / (1/2 (1 - q))
= 3/2 q / (1 - q).
Thus, in order to compute a new likelihood ratio, we merely have to
multiply the old one by 3/2.
To repeat the example above, we start with a likelihood ratio of 1.
Then if we see heads, we get 3/2. Then if we see another, we get
9/4. And if we see another, we get 27/8. And if we see one more, we
get 81/16, and it's a good bet that my friend is cheating.
What if we see tails? We get
Pr[X = 1 | Y_1 = T] / Pr[X = 2 | Y_1 = T]
= (Pr[Y_1 = T | X = 1] Pr[X = 1]) /
(Pr[Y_1 = T | X = 2] Pr[X = 2])
= (1/4 q) / (1/2 (1 - q))
= 1/2 q (1 - q).
So we update our likelihood ratio by multiplying by 1/2.