Administrative info
  HW7 out, due Monday
  MT2 next Tuesday
    Same location and policies as MT1
    Cover through polling/LLN (Wednesday)
  Review session
    Monday 6:30-8 in 320 Soda
  Exams will be available for pickup from Soda front office

Review
  Recall that we defined independence for random variables as follows.
  Let X and Y be random variables, A be the set of values that
  X can take on, B be the set of values Y can take on. Then X
  and Y are independent if
    ∀a∈A ∀b∈B . Pr[X=a ∩ Y=b] = Pr[X=a] Pr[X=b].
  Equivalently, X and Y are independent if
    ∀a∈A ∀b∈B . Pr[X=a|Y=b] = Pr[X=a].

  We can similarly define mutual independence for more than two random
  variables.

Independent Random Variables
  We have used the fact that Var(X+Y) = Var(X) + Var(Y) for
  independent random variables X and Y. We now prove that fact.

  For independent random variables X and Y, we have
    E(XY) = E(X)E(Y).
  Proof:
    E(XY) = ∑_{a,b} ab Pr[X=a ∩ Y=b]
          = ∑_{a,b} ab Pr[X=a] Pr[Y=b]
          = ∑_{a} ∑_{b} ab Pr[X=a] Pr[Y=b]
          = (∑_{a} a Pr[X=a]) * (∑_{b} b Pr[Y=b])
          = E(X) * E(Y).

  We have already claimed that for independent random variables X and
  Y, Var(X+Y) = Var(X) + Var(Y). Now we prove it.
    Var(X+Y) = E((X+Y)^2) - E(X+Y)^2
      = E(X^2 + 2XY + Y^2) - (E(X) + E(Y))^2
      = E(X^2) + 2E(XY) + E(Y^2) - E(X)^2 - 2E(X)E(Y) - E(Y)^2
      = E(X^2) + 2E(X)E(Y) + E(Y^2) - E(X)^2 - 2E(X)E(Y) - E(Y)^2
      = E(X^2) + E(Y^2) - E(X)^2 - E(Y)^2
      = Var(X) + Var(Y).

  What about if X and Y are not independent? Let's consider the
  extreme case Y = X. Then
    E(XY) = E(XX) = E(X^2) ≠ E(X)^2 in general
    Var(X+Y) = Var(2X) = 4Var(X) ≠ 2Var(X).
  So the two facts above do not hold if X and Y are not independent.

Joint Distribution
  The fact that two random variables X and Y are independent gives us
  all the information that we need in order to perform calculations
  involving those two variables. We now turn our attention to the
  general case when X and Y are not independent.

  Recall that for non-independent events A and B, in order to compute
  probabilities of the form
    Pr[A ∩ B]
    Pr[A ∪ B]
    Pr[A|B]
  we needed to know Pr[A ∩ B]. This quantity encodes all the
  information about how A and B are correlated.

  For random variables, we need something similar. We need the values
    Pr[X = a ∩ Y = b]
  for all values a in the set of values that X can take on and all
  values b in the set of values that Y can take on. This set of
  probabilities is called the "joint distribution" of X and Y.

  Since we now will be making quite heavy use of intersections and
  writing ∩ everywhere is tedious, we use a comma instead to
  denote intersection:
    Pr[X = a, Y = b].

  Before we continue with examples of joint distributions, we first
  generalize our definition of random variables. Previously, we
  defined a random variable X on a probability space Ω as a
  function from Ω to the real numbers
    X : Ω -> R,
  so X(ω) is a real number for all ω ∈ Ω. We
  can generalize the range of a random variable to be any set S:
    X : Ω -> S.
  For example, in PA3, we went to the trouble of defining a numerical
  value for winning, drawing, and losing, and similarly for each type
  of hand, in order to define the random variables W and O. Instead,
  we could have defined the range of W to be the set
    S = {Lose, Draw, Win}.
  Then W would assign an element of S to each outcome ω.
    ∀ ω ∈ Ω . W(ω) ∈ S.
  If the range of a random variable is not a subset of R, however, we
  cannot talk about its expectation or variance. (In the case of PA3,
  the members of S are ordered, so it made sense to use numerical
  values so that we can compute expectation and variance, which give
  us information about how much money we expect to win or lose.
  Similarly, the types of hands are ordered as well.)

  We now turn to examples of joint distributions using generalized
  random variables.

  Suppose we are trying to diagnose a rare ailment (say
  neurocysticercosis or something else you would see on House,
  M.D.) according to the severity of a particular symptom. Let X
  be random variable that is 1 if the patient has the disease, 0
  otherwise. Let Y be a random variable that takes on one of the
  values {none, moderate, severe}. Then the joint distribution of X
  and Y might be the following:
    Pr[X=a,Y=b]    Y   none   moderate   severe
                 X
                 0     0.72     0.18      0.00
                 1     0.02     0.05      0.03
  In other words, Pr[X = 0, Y = none] = 0.72, so 72% of patients have
  neither the disease nor any symptoms. The table gives us all values
  of Pr[X = a, Y = b], so it completely specifies the joint
  distribution of X and Y.

  The joint distribution gives us all of the information about X and
  Y. Since random variables partition the sample space, we can use
  the total probability rule to obtain "marginal" distributions for
  X and Y:
    Pr[X = 0] = Pr[X = 0, Y = none] +
                Pr[X = 0, Y = moderate] +
                Pr[X = 0, Y = severe]
      = 0.90.
  We can do this for all values of X and Y by adding the values in
  the appropriate row or column of the table:
    Pr[X=a,Y=b]    Y    none   moderate   severe  |  Pr[X = a]
                 X                                |
                 0      0.72     0.18      0.00   |    0.90
                 1      0.02     0.05      0.03   |    0.10
             -------------------------------------'
             Pr[Y = b]  0.74     0.23      0.03
  This implies that 10% of all patients have the disease, and 3% of
  all patients have severe symptoms.

  For independent random variables Q and R, we have
    Pr[Q = q, R = r] = Pr[Q = q] Pr[R = r],
  so the joint distribution is the product of the marginals when the
  random variables are independent.

  Recall that Y = b is just an event, so we can compute conditional
  probabilities given the event Y = b:
    Pr[X = 0 | Y = b] = Pr[X = 0, Y = b] / Pr[Y = b]
    Pr[X = 1 | Y = b] = Pr[X = 1, Y = b] / Pr[Y = b]
  This set of probabilities is called the "conditional distribution"
  of X given Y = b. We can write the conditional distributions Pr[X =
  a | Y = b] in table form as well:
    Pr[X=a|Y=b]    Y   none   moderate   severe
                 X
                 0     0.97     0.78      0.00
                 1     0.03     0.22      1.00
  We can similarly compute the conditional distributions Pr[Y = b | X
  = a]:
    Pr[Y=b|X=a]    Y   none   moderate   severe
                 X
                 0     0.80     0.20      0.00
                 1     0.20     0.50      0.30

  Like unconditional distributions, conditional distributions must sum
  to 1. Recall how we defined conditional probability: Pr[A|B] is the
  probability of A in a new sample space Ω' given by Ω' =
  B. So conditioning merely defines a new sample space, and all the
  rules of probability must hold in this new sample space.

  Let's do another example. Suppose Victor and I play
  rock-paper-scissors in order to determine who gets to grade problem
  1 on midterm 2. Let X be my choice of weapon, Y Victor's choice.
  Then the joint distribution of X and Y might be:
    Pr[X=a,Y=b]      Y   rock   paper   scissors  |  Pr[X = a]
                 X                                |
               rock      0.12   0.12      0.16    |     0.4
               paper     0.09   0.09      0.12    |     0.3
              scissors   0.09   0.09      0.12    |     0.3
             -------------------------------------'
             Pr[Y = b]   0.3    0.3       0.4
  As you can see, I am slightly biased towards rock, and Victor is
  slightly biased towards scissors. What's my probability of beating
  Victor and getting an easy problem to grade?

  Let W be 1 if I win, 0 if we draw, and -1 if I lose. Then we can
  compute the joint distribution of X and W. We note that if I choose
  rock, I win when he chooses scissors, I lose when he chooses paper,
  and we draw when he chooses rock. So
    Pr[X = rock, W = 1] = Pr[X = rock, Y = scissors]
    Pr[X = rock, W = -1] = Pr[X = rock, Y = paper]
    Pr[X = rock, W = 0] = Pr[X = rock, Y = rock].
  Repeating this for all values of X, we get
    Pr[X=a,W=c]      W    +1     0         -1     |  Pr[X = a]
                 X                                |
               rock      0.16   0.12      0.12    |     0.4
               paper     0.09   0.09      0.12    |     0.3
              scissors   0.09   0.12      0.09    |     0.3
             -------------------------------------'
             Pr[W = c]   0.34   0.33      0.33
  So I have a slightly higher probability of winning than losing,
  which is good news for me. (I get to go home early from grading!)

  We can also compute conditional distributions Pr[W=c|X=a], which
  will tell me that I should choose rock (until he catches on, of
  course).
    Pr[W=c|X=a]      W    +1     0         -1
                 X
               rock      0.4    0.3       0.3
               paper     0.3    0.3       0.4
              scissors   0.3    0.4       0.3
  As expected, since Victor is biased towards scissors, I am more
  likely to win if I choose rock and less likely if I choose paper.

Conditional Probability Spaces
  Suppose that instead, we bet $1 on the outcome of each game of
  rock-paper-scissors. Then W is exactly the amount of money I win,
  and I would like to know how much I can expect to win if I play
  Victor many times. We can compute this from the marginal
  distribution of W:
    E(W) = 1 * 0.34 + 0 * 0.33 - 1 * 0.33
      = 0.01.
  I would also like to know how much I can expect to win for each of
  my choices. That way, I have a better idea of what I should choose
  (again, until Victor catches on). So I want to know the "conditional
  expectation" of W given X = a for each a.

  Again, conditioning on the event X = a gives us a new sample space,
  and anything we can do in an arbitrary sample we can do in this new
  sample space. We just have to replace all our probabilities with
  conditional probabilities given by the new sample space.

  Doing so, we define conditional expectation as follows:
    E(W | E) = ∑_{c ∈ C} c * Pr[W = c | E],
  where E is any event and C is the set of all possible values
  that W can take on. So in this case, we have
    E(W | X = rock) = 1 * 0.4 + 0 * 0.3 - 1 * 0.3 = 0.1
    E(W | X = paper) = 1 * 0.3 + 0 * 0.3 - 1 * 0.4 = -0.1
    E(W | X = scissors) = 1 * 0.3 + 0 * 0.4 - 1 * 0.3 = 0
  So I expect to win more if I choose rock.

  We can obtain the unconditional expectation E(W) from the
  conditional expectations as follows, where A is the set of
  values that X can take on:
    E(W) = ∑_{c∈C} c Pr[W = c]
      = ∑_{c∈C} c (∑_{a∈A} Pr[X=a]Pr[W=c|X=a])
                                             (total probability rule)
      = ∑_{a∈A} Pr[X=a] (∑_{c∈C} c Pr[W=c|X=a])
      = ∑_{a∈A} Pr[X=a] E(W|X=a).
  This is called the "total expectation law." In this case, we get
    E(W) = E(W | X = rock) Pr[X = rock] +
           E(W | X = paper) Pr[X = paper] +
           E(W | X = scissors) Pr[X = scissors]
      = 0.1 * 0.4 + 0 - 0.1 * 0.3 + 0 * 0.3
      = 0.01,
  which is the same as our previous answer.

  In the next discussion section, you will see a neat way of computing
  the expectation of a geometric random variable using conditional
  expectation and the total expectation law.

  We can also define "conditional independence" for events A and B.
  A and B are independent conditional on C if
    Pr[A, B | C] = Pr[A|C] Pr[B|C].
  Equivalently, A and B are independent given C if
    Pr[A | B, C] = Pr[A|C].
  This tells us that if we are given C, then knowing B occurred gives
  us no information about whether or not A occurred.

  Note that Pr[A|B|C] is meaningless, since what is to the right of
  the bar determines our sample space. So in order to condition on two
  events B and C, we condition on their intersection.

  We can similarly define conditional independence for random
  variables.

  There are conditional versions of other probability rules as well.
  (1) inclusion/exclusion
      Pr[A∪B|C] = Pr[A|C] + Pr[B|C] - Pr[A,B|C]
  (2) total probability rule
      Pr[A|C] = Pr[A,B|C] + Pr[A,B|C]
        = Pr[A|B,C] Pr[B|C] + Pr[A|B,C] Pr[B|C]
  (3) Bayes' rule
      Pr[A|B,C] = Pr[B|A,C] Pr[A|C] / Pr[B|C]
  As an exercise, you may wish to prove some of these on your own.