Conditional Probability
and Independence

Conditional probability

Since probability is a model for uncertainty and a lack of information, one expects that probabilities change as new information comes to light. For instance when playing poker, you would like to know how the probability of certain hands changes as various cards are revealed, or when modeling the weather, you like to keep your models and predicted probabilities up to date based on the current information.

Conditional probability deals with how the probability of an event changes with certain information. In general given two event $A, B\subset S$, we would like to make sense of the conditional probability of $A$ given that $B$ has occurred, which we denote by \[ P(A|B). \]

Note

Typically we read the above probability $P(A|B)$ as "the probability of $A$ given $B$" or "probability of $A$ conditioned on $B$.

Let's illustrate this first with an example:

Example 1 (3 coin tosses)

Suppose you flip a fair coin three times. What is the probability of three heads?

Well as we have seen already, we know the sample space is given by

\[ S = \{HHH, HHT, HTH, HTT, THH , THT, TTH, TTT\}. \]

Let $A = \{HHH\}$ be the event that all three tosses are heads. Since all the outcomes are equally likely, we know that \[ P(A) = \frac{1}{8}. \]

Now lets see what happens to the probabilities when you are given some information.

Example 2 (3 coin tosses)

Suppose an oracle told you that the first coin flip is going to be heads. How does this change the probability of getting three heads?

Since we know the first flip is going to be heads, can update our sample space to events which only contain heads on the first toss. Namely we can pretend our sample space is the event that the first flip is heads, \[ B = \{HHH, HHT, HTH ,HTT\}. \] Now, since there are only four events in $B$, these events are all equally likely and only one of them is $HHH$, we easily conclude that \[ P(A | B) = \frac{1}{4}. \]

Indeed by revealing that a head has occurred on the first flip, the probability of getting all head has improved.

How do we make this more general? Here, it is useful to think of $P(A)$ as the proportion of the sample space $S$ taken up by $A$ \[ P(A) = \frac{P(A)}{P(S)}. \] Given that an event $B$ has occurred, we should restrict ourselves to $B$ and view $B$ as the new sample space. This means we should interpret the new probability $P(A|B)$ as the relative proportion of $B$ taken up by $A$.

A graphical illustration of conditional probability

The following definition makes this intuition more precise.

Conditional probability

Given two events $A, B \subset S$, with $P(B)\neq 0$ we define the conditional probability of $A$ given $B$ by \[\tag{1}\label{1} P(A|B) := \frac{P(A\cap B)}{P(B)}. \]

Important

In general, if $P(B) = 0$, then the quantity $P(A|B)$ does not make sense since by defintion, we end up with $\frac{0}{0}$.

Note

The quantity $A|B$ does not make sense as an event on it's own. Instead one should think of $P(A|B)$ in \eqref{1} as a new probability measure for $A$ that was updated with information from $B$.

Question

Can you obtain the result of Example 1 using the formula \eqref{1} for conditional probability?

Example 3

Suppose you flip a fair coin three times. What is the probability of getting exactly two heads, given that the first flip is a heads?

To solve this, define the events \[ A = \{HHT, THH, HTH\} \] \[ B= \{HHH, HHT, HTH,HTT\}. \] Our goal is to find $P(A|B)$. To this end, it is convenient to find \[ A\cap B = \{HHT,HTH\}. \] Since all events are equally likely, we find that \[ P(A|B) = \frac{P(A\cap B)}{P(B)} = \frac{2/6}{4/6} =\frac{1}{2} \]

Law of multiplication

The concept of conditional probability allows us to calculate the probability of the intersection of two events called the law of multiplication.

Law of multiplication

Let $A, B \subset S$, be two events with $P(B)\neq 0$, then \[\tag{2}\label{2} P(A\cap B) = P(A|B)P(B). \]

Note

Note that \eqref{2} is just another way to write the definition of conditional probability \eqref{1}. Moreover, the choice of $A$ and $B$ on the right-hand side is arbitrary. Indeed, due to the symmetry $A\cap B$, we could just as well have defined the law of multiplication by \[ P(A\cap B) = P(B|A)P(A). \] This symmetry will play an important role when we discuss Bayes' rule later.

This formula is very similar to the rule of products we used for counting, but is in fact way more general since it doesn't require events in the sample space to be equally likely. Indeed, it is often useful to think of the sets $A$ $B$ as specific actions which follow each other. If one first performs $B$ followed by $A$, then the probability \eqref{4} says that we can find the probability $P(A\cap B)$ of $A$ followed by $B$ by first calculating $P(B)$ and then $P(A|B)$, the probability of $A$ given that $B$ has already occurred.

Example 4 (Draw two cards from a deck)

Suppose you draw two cards from a standard 52 card deck. What is the probability that the two cards are spades?

In order to do this, we suppose that the two cards are drawn one after another and define two events \[ A_1 = \text{ the first card is a spade} \] \[ A_2 = \text{ the second card is an spade}. \]

We are then interested in the probability of $A_1\cap A_2$, the event that both cards are spades. Using the Multiplicative law, we obtain \[ P(A_1\cap A_2) = P(A_2|A_1)P(A_1). \] Since there are 13 spades out of the 52 card deck, we easily find that \[ P(A_1) = \frac{13}{52}. \] To compute the conditional probability $P(A_2|A_1)$, we note that once we know that the first card is a spade, there are only 12 spades left to choose from out of the remaining 51 card deck. Therefore \[ P(A_2|A_1) = \frac{12}{51}. \] This gives, \[ P(A_1\cap A_2) = \left(\frac{12}{51}\right)\left(\frac{13}{52}\right) = \frac{1}{17}. \]

We can compare this to our combinatorial approach using the rule of products using the fact that all two card combinations are equally likely, which gives

\[ P(A_1\cap A_2) = \frac{|A_1\cap A_2|}{|S|} = \frac{13\cdot 12}{52\cdot 51} = \frac{1}{17}. \]

Note

The law of multiplication can be iterated to apply to multiple sets. For instance, given sets $A,B,C$ with $P(C)\neq 0$ and $P(B\cap C)\neq 0$, we can apply the multiplicative law twice to obtain

\[ \begin{aligned} P(A\cap B\cap C) &= P(A|B\cap C)P(B\cap C)\\ &= P(A|B\cap C)P(B|C)P(C). \end{aligned} \]

Law of total probability

The law of multiplication can often be applied systematically to find the probability of a certain event by breaking things down into various cases. Specifically, suppose that $S$ can be divided up into three pairwise disjoint subsets $B_1,B_2,B_3$ called a partition of $S$, \[ S = B_1 \cup B_2 \cup B_3, \quad B_i\cap B_j = \emptyset, i\neq j. \] Any event $A\subset S$ can then be broken up by this partition into disjoint pieces \[ A = (A\cap B_1)\cup(A\cap B_2)\cup(A\cap B_3), \] where \[ (A \cap B_i)\cap(A\cap B_j) = \emptyset \quad i\neq j. \] This is illustrated in the following figure

A graphical illustration of conditional probability

Therefore using the additive property of disjoint events and the law of multiplication we can write

\[ \begin{aligned} P(A) &= P(A\cap B_1) + P(A\cap B_2) + P(A\cap B_3)\\ &= P(A|B_1)P(B_1) + P(A|B_2)P(B_2) + P(A|B_3)P(B_3). \end{aligned} \]

The second equality above is often referred to as the law of total probability. It is often convenient to represent this law graphically as probability tree:

The above tree is to be interpreted as follows: The top node of the tree corresponds to the entire sample space $S$, the next level down consists of 3 nodes each corresponding to a particular partition $B_1$, $B_2$, $B_3$ of $S$ and the last level consists of nodes corresponds partitioning each event $B_i$ into to the event $A$ and the event $\overline{A}$. Conditional probabilities are written along the edges (or lines) connecting any two nodes, being the conditional probability of the set in the lower node conditioned on the higher node.

The formula

\[\label{3}\tag{3} { \color{red} P(A) = P(A|B_1)P(B_1) + P(A|B_2)P(B_2) + P(A|B_3)P(B_3)} \]

is obtained by working from the bottom of the tree upwards along the red path multiplying probabilities on the way up and then adding probabilities of each branch.

Note

In the probability tree, we chose to include the event $\overline{A}$ in the last row. In general, if all one is interested in is $P(A)$ then it is not necessary to include the $\overline{A}$ branch. We could have just as easily drawn the following tree

which contains all the probabilities necessary for equation \eqref{3}. In general, the reason for including all possible outcomes (in this case $\overline{A}$) has to do with redundancy, since it gives you a way to check for mistakes in calculating probabilities by ensuring that probabilities the branch under a given node add up to $1$. For example, \[ P(A|B_1) + P(\overline{A}|B_1) = 1. \]

Naturally this can be generalized to $k$ disjoint sets $B_1, B_2,\ldots B_k$ in the following way

Law total probability

Let $B_1,B_2,\ldots B_k$ be a partition of $S$ with $P(B_i) \neq 0$ for each $1\leq i \leq k$, then for any event $A$, \[ P(A) = \sum_{j=1}^k P(A|B_j)P(B_j). \]

Example 5 (Urn problem)

Suppose an urn contains 5 red balls and 2 greens balls. You reach in and draw two balls out, one by one, without returning the balls to the urn. What is the probability that the second ball is red?

This problem is actually quite simple and doesn't require the law of total probability or conditional probability since every ball is equally likely to be the second ball (why is this?). Therefore since there are 7 balls and 5 red balls, the probability that the second ball is red is just $5/7$.

However it is illustrative to see how to use the law of total probability to compute this.

Lets define some events:

\begin{array}{ll} &R_1 = \text{first ball is red}, &R_2 = \text{second ball is red}\\ &G_1 = \text{first ball is green}, &G_2 = \text{second ball is green} \end{array}

Of course, we know that $P(R_1) = \frac{5}{7}$ and $P(G_1) = \frac{2}{7}$, therefore we simply need to calculate the conditional probabilities, which are easily obtained by counting \[ P(R_2|R_1) = \frac{4}{6},\quad P(R_2|G_1) = \frac{5}{6}. \] This gives by the law of total probability

\[ \begin{aligned} P(R_2) &= P(R_2|R_1)P(R_1) + P(R_2|G_1)P(G_1)\\ &= \left(\frac{4}{6}\right)\left(\frac{5}{7}\right) + \left(\frac{5}{6}\right)\left(\frac{2}{7}\right)\\ &= \frac{5}{7}. \end{aligned} \]

This can be visualized using the following tree:

where we have highlighted the relevant branches red.

Probability Urn

An example of the type above is often referred to as a probability urn model (Pólya's urn). It is a convenient toy model that serves as a general framework for many exercises in probability and can unify many probablistic models (see Urn problem or Pólya urn model). For instance we can model a fair coin flip by drawing from a urn with equally many red a green balls.

A mentioned in the above example, one does not really need to use the law of total probability to show that $P(R_2) = \frac{5}{7}$. Let's now consider a more complicated setting where we really need the law of total expectation.

Example 6 (Modified urn)

Consider the same set up as Example 4. Suppose instead after the first ball is drawn, if it’s green, a red ball is added to the urn, if it’s red, a green ball is added to the urn (while keeping the first ball you drew). Then the second ball is drawn. What is the probability that the second ball is red?

In this case the probabilities of the first or the second ball being red are not equally likely since we are adding more balls depending on the outcome of the first draw. In this case the law of total probability becomes and very valuable tool for keeping track of the various dependencies. An easy calculation shows that the probability tree is now given by:

Therefore we have

\[ P(R_2) = \left(\frac{4}{7}\right)\left(\frac{5}{7}\right) + \left(\frac{6}{7}\right)\left(\frac{2}{7}\right) = \frac{32}{49}. \]

Independence

It is often the case that knowledge of one event has no effect on the probability of another event. This is known in probability as independence.

Independence: Two events are independent if knowledge of one event doesn't effect the likelihood of the other.

Examples

Outcomes of successive coin flips are generally treated as independent events, since the outcome of one flip should not affect the outcome of the second flip.
The weather at two significantly different times could be considered independent. For instance, knowing that it rained today generally doesn't have any bearing on the probability that it will rain a month from today. This is a hallmark property of chaos.

Note

The concept of independence is an approximate property. It is used for it's mathematical and conceptual convenience. In practice, it is unlikely that two events are actually independent since there are often many convoluted factors that can lead one event to depend on the other. However, for many events, like successive coin flips, the dependence between two events is so weak and so convoluted that for all intents and purposes they are independent.

We can make the idea of independence more precise using conditional expectation. If $P(A) \neq 0$, and $P(B) \neq 0$, then we would expect $A$ and $B$ to be independent if conditioning on either of the events doesn't change the probability of the other event, namely

\[\label{4}\tag{4} P(A|B) = P(A) \quad \text{and}\quad P(B|A) = P(B). \]

In general, using the definition of conditional probability \eqref{4}, we can re write this in a more symmetric way that doesn't require $P(A)$ or $P(B) \neq 0$:

Independence (formal definition)

Given two events $A, B \subset S$, we say $A$ and $B$ are independent if \[\tag{5}\label{5} P(A\cap B) = P(A)P(B). \]

Note

Using the definition of conditional probability it is easy to see that

if $P(B) \neq 0$, then \[ P(A\cap B) = P(A)P(B)\quad \Leftrightarrow \quad P(A|B) = P(A) \]
if $P(A)\neq 0$, then \[ P(A\cap B) = P(A)P(B)\quad \Leftrightarrow\quad P(B|A) = P(B). \]

Therefore \eqref{4} is equivalent to \eqref{5} as long as $P(A), P(B)\neq 0$.

Question

Can two mutually exclusive events be independent?

Paradoxes of independence

The definition \eqref{5} can give rise to some seemingly paradoxical properties of independent sets. For instance if $A$ satisfies $P(A) = 0$, then \[ P(A\cap A) = 0 = P(A)P(A) \] and so $A$ is independent of itself, even though knowledge of $A$ clearly affects the outcome of $A$. This paradox is resolved by realizing that

Example 7

Suppose you roll a fair six sided die. Define the following events based on

\[ \begin{aligned} &A = \text{roll an even number}\\ &B = \text{roll an odd number}\\ &C = \text{roll a 1 or 2} \end{aligned} \]

Are $A$ and $B$ independent? How about $B$ and $C$?

Lets use definition \eqref{5}. Here we easily find that

\[ P(A) = \frac{1}{2}, \quad P(B) = \frac{1}{2}, \quad P(A\cap B) = 0. \]

Therefore, since \[ 0 = P(A\cap B) \neq P(A)P(B) = \frac{1}{4}, \] we conclude that $A$ and $B$ are not independent.

For $B$ and $C$, we find

\[ P(B) = \frac{1}{2}, \quad P(C) = \frac{1}{3}, \] and therefore $P(B)P(C) = \frac{1}{6}$. Since \[ P(B\cap C) = P(\text{roll a 1}) = \frac{1}{6}, \] we conclude that $B$ and $C$ are independent.

Note

Independence is a property of the probability measure and not just the events in question. In general, two events cannot be independent without knowing the probability model used. Indeed any two events with a non-empty overlap can be made independent with the right assignment of probabilities. The next example illustrates this.

Example 8

Consider the following Venn diagram for sets $A$ and $B$.

The probabilities of various non-overlapping regions are labeled, namely

\[ P(A\backslash B) = 0.4 , \quad P(A\cap B) = 0.1 \quad P(B\backslash A) = 0.1. \]

Are $A$ and $B$ independent?

Here we can find $P(A)$ and $P(B)$ by

\[ P(A) = 0.4 + 0.1 = 0.5 \] and \[ P(B) = 0.1 + 0.1 = 0.2. \] Therefore since $P(A)P(B) = (.5)(.2) = .1 = P(A\cap B)$, we conclude that $A$ and $B$ are independent.

Resources

MATH 3080 - Spring 2024

Conditional Probability
and Independence