Descriptive Analysis with Python

Yi-Ju Tseng @NYCU

Basic Setup

Python installation

You can install python in different way. For someone who are not familiar with python, feel free to follow these steps:

  • Download python @ https://www.python.org/downloads/
  • Install it by double click the installer
  • Check if the installation success by typing python3 in CMD (Windows) or terminal (Mac or Linux)

Packages/libraries installation

In CMD (Windows) or terminal (Mac or Linux), type

pip3 install pandas jupyter seaborn

to install packages

Of course you can use conda or other method to install packages

Setup an IDE for data analysis

  • If you are not familiar with python IDE, I recommend you to use VS Code with Jupyter notebook.
  • Jupyter notebook is a really good python interface for data analysis (of course including statistic).
  • You can install and using Jupyter notebook in different ways.
    • We have installed jupyter library in the previous step
  • Check the document for more instruction

1. Download and install VS Code

VS Code

2. Check python and jupyter package

Make sure you have installed python and jupyter package (see previous slide)

@ Terminal or CMD or Python

3. Install Python extension

In VS Code, click Extensions, search python, and install the extension

Jupyter extension is also useful

4. Trust your workspace

Click trust and add the folder with codes in the trust workspace

5. Create a Jupyter Notebook

  • by running the Create: New Jupyter Notebook command from the Command Palette (⇧⌘P)
  • by creating a new .ipynb file in your workspace

6. Now we are ready to start!

Import Packages and Data

Import packages/libraries

We will use the following packages in python for descriptive analysis.

  • statistics
  • pandas
import statistics as st
import pandas as pd

Data import

Import data and analyze with python using pandas. pd.read_csv("file path + name")

data = pd.read_csv("baseball_data.csv")
data
name handedness height weight bavg HR
0 Jose Cardenal Right 70 150 0.275 138
1 Darrell Evans Left 74 200 0.248 414
2 Buck Martinez Right 70 190 0.225 58
3 John Wockenfuss Right 72 190 0.262 86
4 Tommy McCraw Left 72 183 0.246 75
... ... ... ... ... ... ...
300 Bob Watson Right 72 201 0.295 184
301 Ken Harrelson Right 74 190 0.239 131
302 Ed Charles Right 70 170 0.263 86
303 Tony Conigliaro Right 75 185 0.264 166
304 Phil Garner Right 70 175 0.260 109

305 rows × 6 columns

Central Tendency

  • Mean
  • Median

Mean

We can use:

  • statistics (as st) package’s function
    • mean(pandas series)
  • pandas series’s function
    • mean()
    • get pandas series using [column name]
st.mean(data['height'])
data['height'].mean()
72.80655737704917

Mean - more data

  • The pandas series’s function can deal with multiple columns

    • get multiple columns using [[column 1,column 2]]
data[['height','weight']].mean()
height     72.806557
weight    187.449180
dtype: float64

Median

We can use:

  • statistics (as st) package’s function
    • median(pandas series)
  • pandas series’s function
    • median()
st.median(data['height'])
data['height'].median()
73.0

Variability

Variance

We can use:

  • statistics (as st) package’s function
    • variance(pandas series)
  • pandas series’s function
    • var()
st.variance(data['height'])
data['height'].var()
3.2223252804141502

Standard deviation

For sample standard deviation, we can use:

  • statistics (as st) package’s function
    • stdev(pandas series)
  • pandas series’s function
    • std()
st.stdev(data['height'])
data['height'].std()
1.7950836416206768

Standard deviation - population

For population standard deviation, we can use:

  • statistics (as st) package’s function
    • pstdev(pandas series)
st.pstdev(data['height'])
1.7921384654916481

Quantiles -1

For quantiles, we can use:

  • statistics (as st) package’s function
    • quantiles(pandas series, n=the number of partitions)
st.quantiles(data['height'],n=4)
[72.0, 73.0, 74.0]

Quantiles -2

  • pandas series’s function
    • quantile(percentiles you want to get)
data['height'].quantile([0.25,0.5,0.75])
0.25    72.0
0.50    73.0
0.75    74.0
Name: height, dtype: float64

Summary for a single Column

  • It is impossible to get all the statistics one by one
  • pandas series provide a warp up function
    • describe()
data['height'].describe()
count    305.000000
mean      72.806557
std        1.795084
min       67.000000
25%       72.000000
50%       73.000000
75%       74.000000
max       78.000000
Name: height, dtype: float64

Summary for a pandas table

You can get ll the descriptive analysis statistics for a pandas table

  • describe()
data.describe()
height weight bavg HR
count 305.000000 305.000000 305.00000 305.000000
mean 72.806557 187.449180 0.26142 139.426230
std 1.795084 15.439766 0.01889 91.206363
min 67.000000 150.000000 0.21200 50.000000
25% 72.000000 175.000000 0.24800 76.000000
50% 73.000000 190.000000 0.26000 109.000000
75% 74.000000 195.000000 0.27400 173.000000
max 78.000000 230.000000 0.32800 563.000000

Correlation

pandas series provide a function corr() to calculate the correlation coefficient between two series.

data['height'].corr(data['weight'])
0.596814815286583

Visualization

Import packages/libraries

We will use the following packages in python for visualization.

  • seaborn
  • matplotlib
import seaborn as sns

Histogram

  • histplot(data=your data frame,x=x axis) from seaborn
sns.histplot(data=data, x="height")
<Axes: xlabel='height', ylabel='Count'>

Histogram - bin width

sns.histplot(data=data, x="height", binwidth=2)
<Axes: xlabel='height', ylabel='Count'>

Bar chart

  • barplot(data=your data, x=x axis, y=y axis) from seaborn
sns.barplot(data=data, x="handedness",y="weight")
<Axes: xlabel='handedness', ylabel='weight'>

Box plot

  • boxplot(data=your data, x=x axis, y=y axis) from seaborn
sns.boxplot(data=data, x="height",y="handedness")
<Axes: xlabel='height', ylabel='handedness'>

Time series plot

  • lineplot(data=your data frame,x=time column,y=data column) from seaborn
flights = pd.read_csv("flights.csv")
flights.head()
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121

Example

sns.lineplot(data=flights, x="year", y="passengers")
<Axes: xlabel='year', ylabel='passengers'>

Scatter plot

  • scatterplot(data=your data, x=x axis, y=y axis) from seaborn
sns.scatterplot(data=data, x="height", y="weight")
<Axes: xlabel='height', ylabel='weight'>

Summary

  • Python 101
  • Mean, Median
  • Variance, SD, IQR
  • Correlation
  • Visualization
    • histogram, bar chart, box plot, time series plot (line chart), scatter plot
  • Summary