Administrative info
HW7 out, due Monday
MT2 next Tuesday
Same location and policies as MT1
Cover through polling/LLN (Wednesday)
Review
Recall that the variance of a random variable Z with expectation
E(Z) = μ is
Var(Z) = E((Z - μ)^2).
It tells us something about the deviation of Z from its mean.
The "standard deviation" of Z is
σ(Z) = √{Var(Z)},
which in some sense undoes the square in the variance.
An alternative expression for variance is
Var(Z) = E(Z^2) - μ^2.
For a binomial random variable X ~ Bin(n, p), Var(X) = np(1-p).
For a geometric random variable Y ~ Geom(p), Var(Y) = (1-p)/p^2.
For a Poisson random variable Z ~ Poiss(λ), Var(Z) = λ.
Here are some useful facts about variance.
(1) Var(cX) = c^2 Var(X), where c is a constant.
(2) Var(X+c) = Var(X), where c is a constant.
(3) Var(X+Y) = Var(X) + Var(Y) if X and Y are independent.
We defined independence for random variables as follows. Let X and Y
be random variables, A be the set of values that X can take on, B be
the set of values Y can take on. Then X and Y are independent if
∀a∈A ∀b∈b . Pr[X=a ∩ Y=b] = Pr[X=a] Pr[X=b].
Equivalently, X and Y are independent if
∀a∈A ∀b∈b . Pr[X=a|Y=b] = Pr[X=a].
We can similarly define mutual independence for more than two random
variables.
Chebyshev's inequality
So far, we have been working with the squares of deviations from the
mean. However, we are more interested in the absolute value of such
deviations. Now we will see how the two can be related.
Let X be an arbitrary random variable, μ = E(X), and α be
any deviation of interest α > 0. Then
Pr[|X - μ| >= α] <= Var(X)/α^2.
This is called "Chebyshev's inequality." It tells us that the
probability that X deviates from its mean by at least α is at
most Var(X)/α^2.
Another expression of Chebyshev's inequality, with α =
βσ, where σ = √{Var(X)} is the standard
deviation of X, is
Pr[|X - μ| >= βσ] <= 1/β^2.
This tells us, for example, that the probability of X deviating from
the mean by more than two standard deviations is no more than 1/4.
The proof of Chebyshev's inequality relies on "Markov's inequality:"
Pr[X >= α] <= E(X)/α
for a random variable X that only takes on non-negative values.
Generally speaking, Chebyshev's inequality gives us a tighter bound
than Markov's, since it uses more information about X, namely its
variance. Of course, if X takes on negative values, then we can't
use Markov's inequality anyway.
The proofs of both inequalities are in the reader.
EX: In the random walk, we had E(Y) = 0, Var(Y) = n. Then if we
take n = 1,000,000 steps, the probability we end up at least
10,000 steps away is at most
Pr[|Y - 0| >= 10^4] <= 10^6/(10^4)^2
= 1/100.
EX: If I have a probability p = 1/100 of passing my driving exam,
then if T is the number of attempts it takes, T ~ Geom(1/100),
E(T) = 100, Var(T) = 99/100 / (1/100)^2 = 9900. Then the
probability it takes me at least 900 attempts is at most
Pr[T >= 900] = Pr[T - 100 >= 800]
<= Pr[|T - 100| >= 800]
<= 9900/800^2
≈ 0.015.
In the second line, we used the fact that
Pr[|T-100| >= 800] = Pr[T-100 >= 800] + Pr[T-100 <= -800],
so
Pr[|T-100| >= 800] >= Pr[T-100 >= 800].
This is a useful trick. (In this case, the two are equal, since
T only takes on positive values.)
Since T only takes on non-negative values, we can use Markov's
bound. Let's see what we get.
Pr[T >= 900] <= 100/900
= 1/9 ≈ 0.11.
This is a much weaker bound.
Polling
Suppose we want to estimate the proportion p of Democrats in
California. How many residents do we need to ask in order to be 95%
sure that our estimate of p is within 0.05 of the actual value?
This is an example of polling, where we want to estimate some
fraction p to within a given "error", specified by ε at a
given "confidence level", specified by δ, the uncertainty in
our estimate. Here, ε = 0.05, and δ = 1-0.95 = 0.05.
(In real-life polls, you will see these two values given along with
the result of the poll.)
Let's formalize this problem. Suppose we ask n random Californians
whether or not they are Democrats. Let S_n be the number of people
who say yes. Then we estimate the fraction p as A_n = S_n/n, i.e.
the fraction of people we sampled who are Democrats.
As usual, we define indicator random variables
X_i = {1 if the ith person is a Democrat
{0 otherwise.
Then S_n = X_1 + X_2 + ... + X_n. Since each person is randomly
chosen, Pr[X_i = 1] = p, the proportion of Democrats in the entire
population. (Of course, we don't know p. But it has some value, so
our analysis will hold, as long as we don't make any assumptions
about its value.)
As always for an indicator random variable, E(X_i) = Pr[X_i = 1], so
E(X_i) = p. Then by linearity of expectation, E(S_n) = np.
What above Var(S_n)? For each X_i, we have
Var(X_i) = E(X_i^2) - E(X_i)^2
= p - p^2
= p(1-p),
Then we notice that the X_i are mutually independent, since each
person polled is chosen independently and uniformly at random.
Therefore, using fact (3) above, we get
Var(S_n) = Var(X_1) + ... + Var(X_n)
= n Var(X_i)
= np(1-p).
Of course, we could have noticed that S_n ~ Bin(n, p) and arrived at
E(S_n) and Var(S_n) immediately. But the process above works in the
general case, in which X_i is not an indicator random variable, as
long as they are "independent and identically distributed,"
abbreviated as "i.i.d.". For example, we may want to estimate the
average wealth of Californians, in which case X_i has a more
complicated distribution. We would still get E(X_i) = p, but we
would now need to place a bound on Var(X_i) rather than using the
exact value.
Finally, let's consider our estimate A_n = S_n/n. By linearity of
expectation, E(A_n) = 1/n E(S_n) = p. Similarly, by fact (1) above,
we get Var(A_n) = 1/n^2 Var(S_n) = p(1-p)/n.
This is good news, since we expect the estimate A_n to be p, and the
variance of A_n goes down linearly as we increase the sample size n.
But how big does n have to be to achieve the required error and
confidence?
We want no more than a δ probability that the estimate A_n will
deviate from p by more than ε, or
Pr[|A_n - p| >= ε] <= δ.
We can use Chebyshev's inequality to bound the left hand side.
Plugging in ε = 0.05, we get
Pr[|A_n - p] >= 0.05] <= Var(A_n)/0.05^2
= p(1-p)/(0.0025n).
Unfortunately, we don't know p, since that is what we are trying to
estimate. But we can place an upper bound on p(1-p)
p(1-p) <= 1/4,
so
Pr[|A_n - p| >= 0.05] <= p(1-p)/(0.0025n)
<= 0.25/(0.0025n)
= 100/n.
Now we wanted
Pr[|A_n - p| >= 0.05] <= 100/n <= 0.05,
so
100/n <= 0.05
n >= 2000.
Thus, if we poll at least 2000 random Californians, we can be 95%
sure that our estimate of p is within 0.05 of the actual value.
We can repeat the above procedure in the general case, and we will
find that it is enough for the sample size n to satisfy
n >= σ^2 1/(ε^2 δ),
where σ^2 = Var(X_i). In practice, we would use an upper bound
on σ^2, since we don't know what its exact value is.
The Law of Large Numbers
As we noted above, Var(A_n) decreases as the sample size increases.
This implies that A_n converges to its expected value as n
increases. A_n is the average of the random variables X_i. In
general, if we make many observations of i.i.d. random variables,
then their average converges to the expected value. We formalize
this as the "law of large numbers."
Let X_1, ..., X_n be i.i.d. random variables with expectation μ =
E(X_i). Let A_n = 1/n (X_1 + ... + X_n). Then for any α > 0,
we have
Pr[|A_n - μ| >= α] -> 0 as n -> ∞.
Proof:
Let σ^2 = Var(X_i). Then E(A_n) = μ and Var(A_n) =
σ^2/n. By Chebyshev's inequality, we have
Pr[|A_n - μ| >= α] <= Var(A_n)/α^2
= σ^2/(nα^2)
which goes to 0 as n -> ∞.
Thus, as the number of samples increases, the law of large numbers
tells us that the deviation of their average A_n from the mean μ
tends to zero.
Note that the law of large numbers does not tell us that the
deviation of their sum S_n = X_1 + ... + X_n from nμ tends to
zero! For example, if we recall our coin flipping game, then X_i =
+/- 1, and S_n is the amount we win after playing n rounds. While
the average amount A_n we win in each round tends to zero as
we play more rounds, the total amount S_n does not. In fact,
the Chebyshev bound on S_n diverges as n becomes large, since
Var(S_n) = nσ^2.
The assumption that S_n converges to its expectation is known as the
"gambler's fallacy." It manifests itself in gamblers thinking that
if they see a run of heads when flipping a fair coin, for example,
correspondingly more tails will show up to even things out.
You will see the gambler's fallacy in many other areas as well. For
example, if a baseball player is having a cold streak at the plate,
it will be often said that he is "due" for a hit. This is false, of
course, if we take at-bats to be independent. In fact, if we take
psychology into account, then it is probably more likely that the
cold streak will continue.
The lesson here is that the law of large numbers applies to the
average of many observations, but not to the sum.