Random Variable and Distribution

Yi-Ju Tseng @NYCU

Import Packages

Import packages/libraries

We will use the following packages in python.

statistics
pandas
scipy
seaborn
numpy

import statistics as st
import pandas as pd
import seaborn as sns
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

Normal Distribution - Continuous Random Variables

Probability (area under the pdf)

statistic (as st) package’s function:

NormalDist(mu=μ, sigma=σ).cdf(x)

st.NormalDist(mu=0, sigma=1).cdf(1.4)

0.9192433407662289

st.NormalDist().cdf(1.4)

0.9192433407662289

Probability (area under the pdf)

st.NormalDist(mu=0.2508, sigma=0.0005).cdf(0.2485)

2.1124547024964357e-06

st.NormalDist(mu=0.2508, sigma=0.0005).cdf(0.2515)

0.9192433407662224

Quiz 3-3

You work in Quality Control for GE.

Light bulb life has a normal distribution with μ = 2000 hours and σ = 200 hours.

What’s the probability that a bulb will last

A. between 2000 and 2400 hours?

st.NormalDist(mu=2000, sigma=200).cdf(2400)\
  -st.NormalDist(mu=2000, sigma=200).cdf(2000)

0.4772498680518208

Quiz 3-3

You work in Quality Control for GE.

Light bulb life has a normal distribution with μ = 2000 hours and σ = 200 hours.

What’s the probability that a bulb will last

B. less than 1470 hours?

st.NormalDist(mu=2000, sigma=200).cdf(1470)

0.004024588542758334

Finding z-Values for Known Probabilities

statistic (as st) package’s function

NormalDist(mu=μ, sigma=σ).inv_cdf(p-value)

st.NormalDist().inv_cdf(0.1217+0.5)

0.30994865777600955

Quiz 3-4

For a particular generation of the tomato plant, the amount x of miraculin produced had a mean of 105.3 and a standard deviation of 8.0. Assume that x is normally distributed.

Find P(100 < x < 110)

st.NormalDist(mu=105.3, sigma=8).cdf(110)-\
st.NormalDist(mu=105.3, sigma=8).cdf(100)

0.4677406073541885

Quiz 3-4

For a particular generation of the tomato plant, the amount x of miraculin produced had a mean of 105.3 and a standard deviation of 8.0. Assume that x is normally distributed.

Find the value a for which P(x < a) = 0.25

st.NormalDist(mu=105.3, sigma=8).inv_cdf(0.25)

99.90408199843134

Is Normal Distribution a Reasonable Model?

Normal Probability Plots - load data

Import data and analyze with python using pandas. pd.read_csv("file path + name")

data = pd.read_csv("baseball_data.csv")
data

	name	handedness	height	weight	bavg	HR
0	Jose Cardenal	Right	70	150	0.275	138
1	Darrell Evans	Left	74	200	0.248	414
2	Buck Martinez	Right	70	190	0.225	58
3	John Wockenfuss	Right	72	190	0.262	86
4	Tommy McCraw	Left	72	183	0.246	75
...	...	...	...	...	...	...
300	Bob Watson	Right	72	201	0.295	184
301	Ken Harrelson	Right	74	190	0.239	131
302	Ed Charles	Right	70	170	0.263	86
303	Tony Conigliaro	Right	75	185	0.264	166
304	Phil Garner	Right	70	175	0.260	109

305 rows × 6 columns

Normal Probability Plots - Histogram

seaborn package’s

histplot(data=your data frame,x=x axis)

sns.histplot(data=data,x='weight')

<Axes: xlabel='weight', ylabel='Count'>

Normal Probability Plots - PP

With scipy package’s stats, we can use function probplot(data, plot=sns.mpl.pyplot) to draw Probability Plots

<Figure size 960x480 with 0 Axes>

stats.probplot(data['weight'], plot=sns.mpl.pyplot)
sns.mpl.pyplot.show()

Normal Probability Plots - Histogram

numpy’s function random.normal can be used to generate data with normal distribution

standard_norm = np.random.normal(size=3000)
sns.histplot(standard_norm)

<Axes: ylabel='Count'>

Normal Probability Plots - PP

With scipy package’s stats, we can use function probplot(data, plot=sns.mpl.pyplot) to draw Probability Plots

<Figure size 960x480 with 0 Axes>

stats.probplot(standard_norm,plot=sns.mpl.pyplot)
sns.mpl.pyplot.show()

Normal Probability Plots - Histogram

With scipy package’s stats, we can use function skewnorm.rvs() to generate skewed data

skewed_norm = stats.skewnorm.rvs(a=10, size=3000)
sns.histplot(skewed_norm)

<Axes: ylabel='Count'>

Normal Probability Plots - PP

<Figure size 960x480 with 0 Axes>

stats.probplot(skewed_norm,plot=sns.mpl.pyplot)
sns.mpl.pyplot.show()

Statistic Test for Normal Distribution

With scipy package’s stats, the following functions can be used:

Shapiro-Wilk Test
- May not be accurate for N > 5000
- If the p-value > .05, then the data is assumed to be normally distributed.
Kolmogorov-Smirnov Test
- If the p-value > .05, then the data is assumed to be normally distributed.

Statistic Test for Normal Distribution

With scipy package’s stats, the following functions can be used:

Shapiro-Wilk Test shapiro(data)
Kolmogorov-Smirnov Test kstest(data, distribution, args=(mean,sd))

stats.shapiro(standard_norm)

ShapiroResult(statistic=0.9992418885231018, pvalue=0.25002941489219666)

stats.shapiro(skewed_norm)

ShapiroResult(statistic=0.941602349281311, pvalue=7.104239699148655e-33)

Statistic Test for Normal Distribution

With scipy package’s stats, the following functions can be used:

Shapiro-Wilk Test shapiro(data)
Kolmogorov-Smirnov Test kstest(data, distribution, args=(mean,sd))

stats.kstest(standard_norm, 'norm',\
args=(standard_norm.mean(),standard_norm.std()))

KstestResult(statistic=0.014450757533408576, pvalue=0.5530800639002007, statistic_location=-0.3566578832989416, statistic_sign=1)

stats.kstest(skewed_norm, 'norm',\
args=(skewed_norm.mean(),skewed_norm.std()))

KstestResult(statistic=0.07960050707173066, pvalue=5.559960961703108e-17, statistic_location=0.5990846245423823, statistic_sign=1)

Binomial Distribution - Discrete Random Variables

Binomial Distribution - pmf

n: number of trials
x: number of success
p: Probability of a ‘Success’ on a single trial

Binomial Distribution - pmf

With scipy package’s stats:

binom.pmf(x, n, p)

<Figure size 960x480 with 0 Axes>

x = np.arange(0,6)
binomial_pmf = stats.binom.pmf(x, 5, 0.5)
df = {"X":x,"P":binomial_pmf}
sns.barplot(data=df,x="X",y="P")
plt.show()

Binomial Distribution - pmf

Probability (sum of the pmf)

With scipy package’s stats:

binom.pmf(x, n, p)

stats.binom.pmf(3, 5, 0.5)

0.31249999999999983

Probability (sum of the pmf)

stats.binom.pmf(0, 12, 0.2)+\
stats.binom.pmf(1, 12, 0.2)+\
stats.binom.pmf(2, 12, 0.2)

0.5583457484800001

stats.binom.pmf([0,1,2], 12, 0.2)

array([0.06871948, 0.20615843, 0.28346784])

sum(stats.binom.pmf([0,1,2], 12, 0.2))

0.5583457484800001

Quiz 3-6-a

The study found that when presented with prints from the same individual, a fingerprint expert will correctly identify the match 92% of the time.

What is the probability that an expert will correctly identify the match in all five pairs of fingerprints?

stats.binom.pmf(5, 5, 0.92)

0.6590815232000001

Quiz 3-6-b

In contrast, a novice will correctly identify the match 75% of the time. Consider a sample of five different pairs of fingerprints, where each pair is a match.

What is the probability that a novice will correctly identify the match in all five pairs of fingerprints?

stats.binom.pmf(5, 5, 0.75)

0.2373046875

Poisson Distribution - Discrete Random Variables

Poisson Distribution - pmf

With scipy package’s stats:

poisson.pmf(k=x, mu=λ)

<Figure size 960x480 with 0 Axes>

x = np.arange(0, 20)
p = stats.poisson.pmf(k=x, mu=3.6)
sns.barplot(x=x,y=p)
plt.show()

Poisson Distribution - pmf

Probability (sum of the pmf)

With scipy package’s stats:

poisson.pmf(k=x, mu=λ)

stats.poisson.pmf(4, 3.6)

0.19122233917513215

Quiz 3-8

You work in Quality Assurance for an investment firm. A clerk enters 75 words per minute with 6 errors per hour.

stats.poisson.pmf(0, 0.34)

0.7117703227626097

Exponential Distribution - Continuous Random Variables

Probability (area under the pdf)

With scipy package’s stats:

expon.cdf(x, scale=θ)

1-stats.expon.cdf(x=5,scale=2)

0.08208499862389884

Quiz 3-9

the length of life of a magnetron tube has an exponential probability distribution with θ = 6.25.

Suppose a warranty period of 5 years is attached to the magnetron tube. What fraction of tubes must the manufacturer plan to replace?

stats.expon.cdf(x=5,scale=6.25)

0.5506710358827784

Summary

Normal Distribution
- NormalDist(mu=μ, sigma=σ).cdf(x)
- Probability Plot
- Statistic Test
Binomial Distribution binom.pmf(x, n, p)
Poisson Distribution poisson.pmf(k=x, mu=λ)
Exponential Distribution expon.cdf(x, scale=θ)