Descriptive Analysis with Python

Yi-Ju Tseng @NYCU

Basic Setup

Python installation

You can install python in different way. For someone who are not familiar with python, feel free to follow these steps:

Download python @ https://www.python.org/downloads/
Install it by double click the installer
Check if the installation success by typing python3 in CMD (Windows) or terminal (Mac or Linux)

Packages/libraries installation

In CMD (Windows) or terminal (Mac or Linux), type

pip3 install pandas jupyter seaborn

to install packages

Of course you can use conda or other method to install packages

Setup an IDE for data analysis

If you are not familiar with python IDE, I recommend you to use VS Code with Jupyter notebook.
Jupyter notebook is a really good python interface for data analysis (of course including statistic).
You can install and using Jupyter notebook in different ways.
- We have installed jupyter library in the previous step
Check the document for more instruction

1. Download and install VS Code

VS Code

2. Check python and jupyter package

Make sure you have installed python and jupyter package (see previous slide)

@ Terminal or CMD or Python

3. Install Python extension

In VS Code, click Extensions, search python, and install the extension

Jupyter extension is also useful

4. Trust your workspace

Click trust and add the folder with codes in the trust workspace

5. Create a Jupyter Notebook

by running the Create: New Jupyter Notebook command from the Command Palette (⇧⌘P)
by creating a new .ipynb file in your workspace

6. Now we are ready to start!

Import Packages and Data

Import packages/libraries

We will use the following packages in python for descriptive analysis.

statistics
pandas

import statistics as st
import pandas as pd

Data import

Import data and analyze with python using pandas. pd.read_csv("file path + name")

data = pd.read_csv("baseball_data.csv")
data

	name	handedness	height	weight	bavg	HR
0	Jose Cardenal	Right	70	150	0.275	138
1	Darrell Evans	Left	74	200	0.248	414
2	Buck Martinez	Right	70	190	0.225	58
3	John Wockenfuss	Right	72	190	0.262	86
4	Tommy McCraw	Left	72	183	0.246	75
...	...	...	...	...	...	...
300	Bob Watson	Right	72	201	0.295	184
301	Ken Harrelson	Right	74	190	0.239	131
302	Ed Charles	Right	70	170	0.263	86
303	Tony Conigliaro	Right	75	185	0.264	166
304	Phil Garner	Right	70	175	0.260	109

305 rows × 6 columns

Central Tendency

Mean
Median

Mean

We can use:

statistics (as st) package’s function
- mean(pandas series)
pandas series’s function
- mean()
- get pandas series using [column name]

st.mean(data['height'])
data['height'].mean()

72.80655737704917

Mean - more data

The pandas series’s function can deal with multiple columns
- get multiple columns using [[column 1,column 2]]

data[['height','weight']].mean()

height     72.806557
weight    187.449180
dtype: float64

Median

We can use:

statistics (as st) package’s function
- median(pandas series)
pandas series’s function
- median()

st.median(data['height'])
data['height'].median()

73.0

Variability

Variance

We can use:

statistics (as st) package’s function
- variance(pandas series)
pandas series’s function
- var()

st.variance(data['height'])
data['height'].var()

3.2223252804141502

Standard deviation

For sample standard deviation, we can use:

statistics (as st) package’s function
- stdev(pandas series)
pandas series’s function
- std()

st.stdev(data['height'])
data['height'].std()

1.7950836416206768

Standard deviation - population

For population standard deviation, we can use:

statistics (as st) package’s function
- pstdev(pandas series)

st.pstdev(data['height'])

1.7921384654916481

Quantiles -1

For quantiles, we can use:

statistics (as st) package’s function
- quantiles(pandas series, n=the number of partitions)

st.quantiles(data['height'],n=4)

[72.0, 73.0, 74.0]

Quantiles -2

pandas series’s function
- quantile(percentiles you want to get)

data['height'].quantile([0.25,0.5,0.75])

0.25    72.0
0.50    73.0
0.75    74.0
Name: height, dtype: float64

Summary for a single Column

It is impossible to get all the statistics one by one
pandas series provide a warp up function
- describe()

data['height'].describe()

count    305.000000
mean      72.806557
std        1.795084
min       67.000000
25%       72.000000
50%       73.000000
75%       74.000000
max       78.000000
Name: height, dtype: float64

Summary for a pandas table

You can get ll the descriptive analysis statistics for a pandas table

describe()

data.describe()

	height	weight	bavg	HR
count	305.000000	305.000000	305.00000	305.000000
mean	72.806557	187.449180	0.26142	139.426230
std	1.795084	15.439766	0.01889	91.206363
min	67.000000	150.000000	0.21200	50.000000
25%	72.000000	175.000000	0.24800	76.000000
50%	73.000000	190.000000	0.26000	109.000000
75%	74.000000	195.000000	0.27400	173.000000
max	78.000000	230.000000	0.32800	563.000000

Correlation

pandas series provide a function corr() to calculate the correlation coefficient between two series.

data['height'].corr(data['weight'])

0.596814815286583

Visualization

Import packages/libraries

We will use the following packages in python for visualization.

seaborn
matplotlib

import seaborn as sns

Histogram

histplot(data=your data frame,x=x axis) from seaborn

sns.histplot(data=data, x="height")

<Axes: xlabel='height', ylabel='Count'>

Document

Histogram - bin width

sns.histplot(data=data, x="height", binwidth=2)

<Axes: xlabel='height', ylabel='Count'>

Bar chart

barplot(data=your data, x=x axis, y=y axis) from seaborn

sns.barplot(data=data, x="handedness",y="weight")

<Axes: xlabel='handedness', ylabel='weight'>

Document

Box plot

boxplot(data=your data, x=x axis, y=y axis) from seaborn

sns.boxplot(data=data, x="height",y="handedness")

<Axes: xlabel='height', ylabel='handedness'>

Document

Time series plot

lineplot(data=your data frame,x=time column,y=data column) from seaborn

flights = pd.read_csv("flights.csv")
flights.head()

	year	month	passengers
0	1949	January	112
1	1949	February	118
2	1949	March	132
3	1949	April	129
4	1949	May	121

Document

Example

sns.lineplot(data=flights, x="year", y="passengers")

<Axes: xlabel='year', ylabel='passengers'>

Scatter plot

scatterplot(data=your data, x=x axis, y=y axis) from seaborn

sns.scatterplot(data=data, x="height", y="weight")

<Axes: xlabel='height', ylabel='weight'>

Document

Summary

Python 101
Mean, Median
Variance, SD, IQR
Correlation
Visualization
- histogram, bar chart, box plot, time series plot (line chart), scatter plot
Summary