survey = 1000
bank_tellers = 20
prob_bank_teller = bank_tellers / survey
print(f'The prob of being a bank teller is: {prob_bank_teller:.4f}')The prob of being a bank teller is: 0.0200
JM Ascacibar
July 23, 2024
Messing with Intuition: A Review of Basic Probability Concepts

You might think it’s a total waste of time to revisit probability theory and its laws. The problem that follows was presented by Amos Tversky and Daniel Kahneman to illustrate how people violate the principles of probability when making intuitive judgments.
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable?
A probability is a fraction of a finite set.
survey = 1000
bank_tellers = 20
prob_bank_teller = bank_tellers / survey
print(f'The prob of being a bank teller is: {prob_bank_teller:.4f}')The prob of being a bank teller is: 0.0200
If we choose at random (every person has the same chance of being chosen), the probability they are a bank teller is 2%.
# Download the file with the data
from os.path import basename, exists
def download(url):
filename = basename(url) # extract the name from the url
if not exists(filename): # check if the file is in the local directory
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename) # download the file from the url
print('Downloaded ' + local)
download('https://github.com/AllenDowney/ThinkBayes2/raw/master/data/gss_bayes.csv')| caseid | year | age | sex | polviews | partyid | indus10 | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 1974 | 21.0 | 1 | 4.0 | 2.0 | 4970.0 |
| 1 | 2 | 1974 | 41.0 | 1 | 5.0 | 0.0 | 9160.0 |
| 2 | 5 | 1974 | 58.0 | 2 | 6.0 | 1.0 | 2670.0 |
| 3 | 6 | 1974 | 30.0 | 1 | 5.0 | 4.0 | 6870.0 |
| 4 | 7 | 1974 | 48.0 | 1 | 5.0 | 4.0 | 7860.0 |
The columns are:
caseid: Respondent identifier.
year: Year when the respondent was surveyed.
age: Respondent’s age when surveyed.
sex: Male or female.
polviews: Political views on a range from liberal to conservative.
partyid: Political party affiliation, Democrat, Independent, or Republican.
indus10: Code for the industry the respondent works in.
import matplotlib.pyplot as plt
fig, axs = plt.subplots(3, 2, figsize=(10,15))
# Year
axs[0,0].hist(gss.year, bins=30, color='blue', edgecolor='black')
axs[0,0].set_title('Histogram of Year')
axs[0,0].set_xlabel('Years')
axs[0,0].set_ylabel('Frequency')
# Age
axs[0,1].hist(gss.age, bins=25, color='blue', edgecolor='black')
axs[0,1].set_title('Histogram of Age')
axs[0,1].set_xlabel('Age')
axs[0,1].set_ylabel('Frequency')
# sex
sex_valcount = gss.sex.value_counts(normalize=True)
axs[1,0].barh(sex_valcount.index, sex_valcount.values, edgecolor='black')
axs[1,0].set_yticks(sex_valcount.index)
axs[1,0].set_title('Bar chart Sex')
axs[1,0].set_ylabel('Sex')
axs[1,0].set_xlabel('Percentage')
# polviews
pol_valcount = gss.polviews.value_counts(normalize=True)
axs[1,1].barh(pol_valcount.index, pol_valcount.values, edgecolor='black')
axs[1,1].set_yticks(pol_valcount.index)
axs[1,1].set_title('Bar plot political views')
axs[1,1].set_ylabel('Political views')
axs[1,1].set_xlabel('Percentage')
# partyid
party_valcount = gss.partyid.value_counts(normalize=True)
axs[2,0].barh(party_valcount.index, party_valcount.values, edgecolor='black')
axs[2,0].set_yticks(party_valcount.index)
axs[2,0].set_ylabel('Party afiliation')
axs[2,0].set_xlabel('Percentage')
# indus10
axs[2,1].hist(gss.indus10, bins=gss.indus10.nunique())
axs[2,1].set_title('Hist of industry codes')
axs[2,1].set_ylabel('Frequency')
axs[2,1].set_xlabel('Industry Code')
plt.tight_layout()
plt.show()
The code for “Banking and related activities” is 6870
| caseid | year | age | sex | polviews | partyid | indus10 | |
|---|---|---|---|---|---|---|---|
| 3 | 6 | 1974 | 30.0 | 1 | 5.0 | 4.0 | 6870.0 |
| 33 | 44 | 1974 | 54.0 | 2 | 4.0 | 1.0 | 6870.0 |
| 45 | 56 | 1974 | 59.0 | 1 | 5.0 | 0.0 | 6870.0 |
| 91 | 118 | 1974 | 28.0 | 2 | 4.0 | 1.0 | 6870.0 |
| 106 | 135 | 1974 | 30.0 | 2 | 4.0 | 2.0 | 6870.0 |
print(f"There're {banker.sum()} banker in the dataset, which is {banker.mean()*100:.2f}% of the data")There're 728 banker in the dataset, which is 1.48% of the data
Let’s create a probability function that takes a boolen serie and return probabilities as a fraction of a finite set.
The encoding of polviews is:
1. Extremely liberal
2. Liberal
3. Slightly liberal
4. Moderate
5. Slightly conservative
6. Conservative
7. Extremely conservative
The encoding of partyid is:
0. Strong democrat
1. Not strong democrat
2. Independent, near democrat
3. Independent
4. Independent, near republican
5. Not strong republican
6. Strong republican
7. Other party
map_polviews = {
1:"Extremely liberal",
2:"Liberal",
3:"Slightly liberal",
4:"Moderate",
5:"Slightly conservative",
6:"Conservative",
7:"Extremely conservative"
}
unique_cat = gss.polviews.nunique()
for i in range(1, unique_cat+1):
mask = gss.polviews == i
print(f"The prob of being '{map_polviews[i]}' is {prob(mask)*100:.3f}%") The prob of being 'Extremely liberal' is 2.926%
The prob of being 'Liberal' is 11.783%
The prob of being 'Slightly liberal' is 12.666%
The prob of being 'Moderate' is 38.432%
The prob of being 'Slightly conservative' is 16.109%
The prob of being 'Conservative' is 14.849%
The prob of being 'Extremely conservative' is 3.236%
map_partyid = {
0:"Strong democrat",
1:"Not strong democrat",
2:"Independent, near democrat",
3:"Independent",
4:"Independent, near republican",
5:"Not strong republican",
6:"Strong republican",
7:"Other party"
}
for i in range(len(map_partyid)):
mask = gss.partyid == i
print(f"The prob of being a '{map_partyid[i]}' is {prob(mask)*100:.3f}")
The prob of being a 'Strong democrat' is 15.995
The prob of being a 'Not strong democrat' is 20.631
The prob of being a 'Independent, near democrat' is 12.532
The prob of being a 'Independent' is 14.058
The prob of being a 'Independent, near republican' is 9.255
The prob of being a 'Not strong republican' is 16.101
The prob of being a 'Strong republican' is 10.000
The prob of being a 'Other party' is 1.428
Conjunction is just another name for the logical operation and, which is the probability of the two events both ocurring. It’s also referred as joint probability -> \(P(A\cap B) = P(A|B) P(B)\) “product rule”.
Quite low!
0.0018259281801582471
Higher if you are strong Republican 😃
Conditional probability is a probability that depends on a condition (conditional probability distribution).
\[P(A|B) = P(A\cap B) / P(B)\]
What is the probability that a respondent is a Democrat, given that they are liberal?
A bit more of 50% liberals are democrats.
What is the probability that a respondent is female, given that they are banker?
77% of the banker are females.
As conjunction was commutative (A&B == B&A), conditional probability is not.
conditional(A,B) != conditional(B,A)
(0.7706043956043956, 0.02116102749801969)
77% of the bankers are female, only about 2% of female respondents are bankers
(0.22939560439560439, 0.007331313929496466)
23% of the bankers are males, and 0.7% of the males are banker.
We can also combine conditional probability with conjuction.
What is the probability a respondent is female, given that they are a liberal and Democrat?
We can say that above 57% of all liberals and Democrats are female.
We can derive three relationship between conjunction and conditional probability.
\[P(A|B) = \frac{P(A~\mathrm{and}~B)}{P(B)}\]
For example if we want to know the fraction of banker that are female, \(P(\text{female} \mid \text{banker})\), we can use the conditional probability function as we did before.
The operation as we’ve seen before is:
We could also compute the probability of being female and banker, and divided by the probability of being banker
The result is the same.
From the theorem 1, we can compute a conjunction by the product of two probabilities.
\[P(A~\mathrm{and}~B) = P(B) P(A|B)\]
As we’ve already seen, conjunction is commutative but conditional is not. This means that \(P(A~\mathrm{and}~B) = P(B~\mathrm{and}~A)\).
If we apply the theorem 2 in both sides: \[P(A~\mathrm{and}~B) = P(B) P(A|B)\]
\[P(B~\mathrm{and}~A) = P(A) P(B|A)\]
\[P(B) P(A|B) = P(A) P(B|A)\]
So if we divide by \(P(B)\), we get the theorem 3 that is the Bayes’s theorem: \[P(A|B) = \frac{P(A)P(B|A)}{P(B)}\]
In addition to the three theorems, there’s also the law of total probability:
The total probability of A is the sum of two probabilities: \[P(A) = P(B_1~\mathrm{and}~A) + P(B_2~\mathrm{and}A)\]
Only if \(B_1\) and \(B_2\) are: 1. Mutually exclusive (they can’t occur at the same time) 2. Collectively exhaustive (at least one of the events must occur)
Because maleand female are mutually exclusive and colectively exhaustive (MECE), we get the same result we got by computing the probability of banker directly.
Applying the theorem 2 to the total probability law: \[P(A) = P(B_1)P(A|B_1) + P(B_2)P(A|B_2)\]
0.014769730168391153
When there’re more than two conditions, we can write the total probability law as: \[P(A) = \sum_i P(B_i) P(A|B_i)\] Remember this holds as long as \(B_i\) are mutually exclusive and collectively exhaustive.
polviews
1.0 0.029255
2.0 0.117833
3.0 0.126659
4.0 0.384317
5.0 0.161087
6.0 0.148489
7.0 0.032360
Name: proportion, dtype: float64
What is the probability that a respondent is liberal, given that they are a Democrat?
What is the probability that a respondent is a Democrat, given that they are liberal?
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable? 1. Linda is a banker. 2. Linda is a banker and considers herself a liberal Democrat.
To answer this question, compute
The probability that Linda is a banker, given that she is female,
The probability that Linda is a banker and a liberal Democrat, given that she is female.
There’s a famous quote about young people, old people, liberals, and conservatives that goes something like:
If you are not a liberal at 25, you have no heart. If you are not a conservative at 35, you have no brain.
Whether you agree with this proposition or not, it suggests some probabilities we can compute as an exercise. Rather than use the specific ages 25 and 35, let’s define young and old as under 30 or over 65:
(0.19435991073240008, 0.17328058429701765)
What is the probability that a randomly chosen respondent is a young liberal?
What is the probability that a young person is liberal?
What fraction of respondents are old conservatives?
What fraction of conservatives are old?
At this point, if you encounter a similar Linda problem, you know that the probability of two events occurring together cannot be more probable than the occurrence of either event alone.
In order to establish a strong foundation, we reviewed the concepts of probability, conjunction, conditional probability, and the fundamental laws that underpin Bayes’s theorem.
It’s also a good practical example of how powerful Pandas is for filtering data.
I guess you are ready for this:
Bill is 34 years old. He is intelligent, but unimaginative, compulsive and generally lifeless. In school, he was strong in mathematics but weak in social studies and humanities.
Which statement is more probable:
Most of the content of the notebook is based on the chapter about probability of the book “Think Bayes 2” written by Allen Downey. You can access the book here.
The Bill problem have been taken from this site.