Random Variable and Distribution

Yi-Ju Tseng @NYCU

Import Packages

Import packages/libraries

We will use the following packages in python.

  • statistics
  • pandas
  • scipy
  • seaborn
  • numpy
import statistics as st
import pandas as pd
import seaborn as sns
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

Normal Distribution - Continuous Random Variables

Probability (area under the pdf)

statistic (as st) package’s function:

NormalDist(mu=μ, sigma=σ).cdf(x)

st.NormalDist(mu=0, sigma=1).cdf(1.4)
0.9192433407662289
st.NormalDist().cdf(1.4)
0.9192433407662289

Probability (area under the pdf)

st.NormalDist(mu=0.2508, sigma=0.0005).cdf(0.2485)
2.1124547024964357e-06
st.NormalDist(mu=0.2508, sigma=0.0005).cdf(0.2515)
0.9192433407662224

Quiz 3-3

You work in Quality Control for GE.

Light bulb life has a normal distribution with μ = 2000 hours and σ = 200 hours.

What’s the probability that a bulb will last

A. between 2000 and 2400 hours?

st.NormalDist(mu=2000, sigma=200).cdf(2400)\
  -st.NormalDist(mu=2000, sigma=200).cdf(2000)
0.4772498680518208

Quiz 3-3

You work in Quality Control for GE.

Light bulb life has a normal distribution with μ = 2000 hours and σ = 200 hours.

What’s the probability that a bulb will last

B. less than 1470 hours?

st.NormalDist(mu=2000, sigma=200).cdf(1470)
0.004024588542758334

Finding z-Values for Known Probabilities

statistic (as st) package’s function

NormalDist(mu=μ, sigma=σ).inv_cdf(p-value)

st.NormalDist().inv_cdf(0.1217+0.5)
0.30994865777600955

Quiz 3-4

For a particular generation of the tomato plant, the amount x of miraculin produced had a mean of 105.3 and a standard deviation of 8.0. Assume that x is normally distributed.

  1. Find P(100 < x < 110)
st.NormalDist(mu=105.3, sigma=8).cdf(110)-\
st.NormalDist(mu=105.3, sigma=8).cdf(100)
0.4677406073541885

Quiz 3-4

For a particular generation of the tomato plant, the amount x of miraculin produced had a mean of 105.3 and a standard deviation of 8.0. Assume that x is normally distributed.

  1. Find the value a for which P(x < a) = 0.25
st.NormalDist(mu=105.3, sigma=8).inv_cdf(0.25)
99.90408199843134

Is Normal Distribution a Reasonable Model?

Normal Probability Plots - load data

Import data and analyze with python using pandas. pd.read_csv("file path + name")

data = pd.read_csv("baseball_data.csv")
data
name handedness height weight bavg HR
0 Jose Cardenal Right 70 150 0.275 138
1 Darrell Evans Left 74 200 0.248 414
2 Buck Martinez Right 70 190 0.225 58
3 John Wockenfuss Right 72 190 0.262 86
4 Tommy McCraw Left 72 183 0.246 75
... ... ... ... ... ... ...
300 Bob Watson Right 72 201 0.295 184
301 Ken Harrelson Right 74 190 0.239 131
302 Ed Charles Right 70 170 0.263 86
303 Tony Conigliaro Right 75 185 0.264 166
304 Phil Garner Right 70 175 0.260 109

305 rows × 6 columns

Normal Probability Plots - Histogram

seaborn package’s

histplot(data=your data frame,x=x axis)

sns.histplot(data=data,x='weight')
<Axes: xlabel='weight', ylabel='Count'>

Normal Probability Plots - PP

With scipy package’s stats, we can use function probplot(data, plot=sns.mpl.pyplot) to draw Probability Plots

<Figure size 960x480 with 0 Axes>
stats.probplot(data['weight'], plot=sns.mpl.pyplot)
sns.mpl.pyplot.show()

Normal Probability Plots - Histogram

numpy’s function random.normal can be used to generate data with normal distribution

standard_norm = np.random.normal(size=3000)
sns.histplot(standard_norm)
<Axes: ylabel='Count'>

Normal Probability Plots - PP

With scipy package’s stats, we can use function probplot(data, plot=sns.mpl.pyplot) to draw Probability Plots

<Figure size 960x480 with 0 Axes>
stats.probplot(standard_norm,plot=sns.mpl.pyplot)
sns.mpl.pyplot.show()

Normal Probability Plots - Histogram

With scipy package’s stats, we can use function skewnorm.rvs() to generate skewed data

skewed_norm = stats.skewnorm.rvs(a=10, size=3000)
sns.histplot(skewed_norm)
<Axes: ylabel='Count'>

Normal Probability Plots - PP

<Figure size 960x480 with 0 Axes>
stats.probplot(skewed_norm,plot=sns.mpl.pyplot)
sns.mpl.pyplot.show()

Statistic Test for Normal Distribution

With scipy package’s stats, the following functions can be used:

  • Shapiro-Wilk Test
    • May not be accurate for N > 5000
    • If the p-value > .05, then the data is assumed to be normally distributed.
  • Kolmogorov-Smirnov Test
    • If the p-value > .05, then the data is assumed to be normally distributed.

Statistic Test for Normal Distribution

With scipy package’s stats, the following functions can be used:

  • Shapiro-Wilk Test shapiro(data)
  • Kolmogorov-Smirnov Test kstest(data, distribution, args=(mean,sd))
stats.shapiro(standard_norm)
ShapiroResult(statistic=0.9992418885231018, pvalue=0.25002941489219666)
stats.shapiro(skewed_norm)
ShapiroResult(statistic=0.941602349281311, pvalue=7.104239699148655e-33)

Statistic Test for Normal Distribution

With scipy package’s stats, the following functions can be used:

  • Shapiro-Wilk Test shapiro(data)
  • Kolmogorov-Smirnov Test kstest(data, distribution, args=(mean,sd))
stats.kstest(standard_norm, 'norm',\
args=(standard_norm.mean(),standard_norm.std()))
KstestResult(statistic=0.014450757533408576, pvalue=0.5530800639002007, statistic_location=-0.3566578832989416, statistic_sign=1)
stats.kstest(skewed_norm, 'norm',\
args=(skewed_norm.mean(),skewed_norm.std()))
KstestResult(statistic=0.07960050707173066, pvalue=5.559960961703108e-17, statistic_location=0.5990846245423823, statistic_sign=1)

Binomial Distribution - Discrete Random Variables

Binomial Distribution - pmf

  • n: number of trials
  • x: number of success
  • p: Probability of a ‘Success’ on a single trial

Binomial Distribution - pmf

With scipy package’s stats:

binom.pmf(x, n, p)

<Figure size 960x480 with 0 Axes>
x = np.arange(0,6)
binomial_pmf = stats.binom.pmf(x, 5, 0.5)
df = {"X":x,"P":binomial_pmf}
sns.barplot(data=df,x="X",y="P")
plt.show()

Binomial Distribution - pmf

Probability (sum of the pmf)

With scipy package’s stats:

binom.pmf(x, n, p)

stats.binom.pmf(3, 5, 0.5)
0.31249999999999983

Probability (sum of the pmf)

stats.binom.pmf(0, 12, 0.2)+\
stats.binom.pmf(1, 12, 0.2)+\
stats.binom.pmf(2, 12, 0.2)
0.5583457484800001
stats.binom.pmf([0,1,2], 12, 0.2)
array([0.06871948, 0.20615843, 0.28346784])
sum(stats.binom.pmf([0,1,2], 12, 0.2))
0.5583457484800001

Quiz 3-6-a

The study found that when presented with prints from the same individual, a fingerprint expert will correctly identify the match 92% of the time.

  1. What is the probability that an expert will correctly identify the match in all five pairs of fingerprints?
stats.binom.pmf(5, 5, 0.92)
0.6590815232000001

Quiz 3-6-b

In contrast, a novice will correctly identify the match 75% of the time. Consider a sample of five different pairs of fingerprints, where each pair is a match.

  1. What is the probability that a novice will correctly identify the match in all five pairs of fingerprints?
stats.binom.pmf(5, 5, 0.75)
0.2373046875

Poisson Distribution - Discrete Random Variables

Poisson Distribution - pmf

Poisson Distribution - pmf

With scipy package’s stats:

poisson.pmf(k=x, mu=λ)

<Figure size 960x480 with 0 Axes>
x = np.arange(0, 20)
p = stats.poisson.pmf(k=x, mu=3.6)
sns.barplot(x=x,y=p)
plt.show()

Poisson Distribution - pmf

Probability (sum of the pmf)

With scipy package’s stats:

poisson.pmf(k=x, mu=λ)

stats.poisson.pmf(4, 3.6)
0.19122233917513215

Quiz 3-8

You work in Quality Assurance for an investment firm. A clerk enters 75 words per minute with 6 errors per hour.

stats.poisson.pmf(0, 0.34)
0.7117703227626097

Exponential Distribution - Continuous Random Variables

Probability (area under the pdf)

With scipy package’s stats:

expon.cdf(x, scale=θ)

1-stats.expon.cdf(x=5,scale=2)
0.08208499862389884

Quiz 3-9

the length of life of a magnetron tube has an exponential probability distribution with θ = 6.25.

Suppose a warranty period of 5 years is attached to the magnetron tube. What fraction of tubes must the manufacturer plan to replace?

stats.expon.cdf(x=5,scale=6.25)
0.5506710358827784

Summary

  • Normal Distribution
    • NormalDist(mu=μ, sigma=σ).cdf(x)
    • Probability Plot
    • Statistic Test
  • Binomial Distribution binom.pmf(x, n, p)
  • Poisson Distribution poisson.pmf(k=x, mu=λ)
  • Exponential Distribution expon.cdf(x, scale=θ)