You can install python in different way. For someone who are not familiar with python, feel free to follow these steps:
python3
in CMD (Windows) or terminal (Mac or Linux)In CMD (Windows) or terminal (Mac or Linux), type
pip3 install pandas jupyter seaborn
to install packages
Of course you can use conda or other method to install packages
jupyter
library in the previous stepMake sure you have installed python
and jupyter
package (see previous slide)
@ Terminal or CMD or Python
In VS Code, click Extensions, search python, and install the extension
Jupyter extension is also useful
Click trust and add the folder with codes in the trust workspace
We will use the following packages in python for descriptive analysis.
statistics
pandas
Import data and analyze with python using pandas. pd.read_csv("file path + name")
name | handedness | height | weight | bavg | HR | |
---|---|---|---|---|---|---|
0 | Jose Cardenal | Right | 70 | 150 | 0.275 | 138 |
1 | Darrell Evans | Left | 74 | 200 | 0.248 | 414 |
2 | Buck Martinez | Right | 70 | 190 | 0.225 | 58 |
3 | John Wockenfuss | Right | 72 | 190 | 0.262 | 86 |
4 | Tommy McCraw | Left | 72 | 183 | 0.246 | 75 |
... | ... | ... | ... | ... | ... | ... |
300 | Bob Watson | Right | 72 | 201 | 0.295 | 184 |
301 | Ken Harrelson | Right | 74 | 190 | 0.239 | 131 |
302 | Ed Charles | Right | 70 | 170 | 0.263 | 86 |
303 | Tony Conigliaro | Right | 75 | 185 | 0.264 | 166 |
304 | Phil Garner | Right | 70 | 175 | 0.260 | 109 |
305 rows × 6 columns
We can use:
st
) package’s function
mean(pandas series)
mean()
[column name]
The pandas series’s function can deal with multiple columns
[[column 1,column 2]]
We can use:
st
) package’s function
median(pandas series)
median()
We can use:
st
) package’s function
variance(pandas series)
var()
For sample standard deviation, we can use:
st
) package’s function
stdev(pandas series)
std()
For population standard deviation, we can use:
st
) package’s function
pstdev(pandas series)
For quantiles, we can use:
st
) package’s function
quantiles(pandas series, n=the number of partitions)
quantile(percentiles you want to get)
describe()
You can get ll the descriptive analysis statistics for a pandas table
describe()
height | weight | bavg | HR | |
---|---|---|---|---|
count | 305.000000 | 305.000000 | 305.00000 | 305.000000 |
mean | 72.806557 | 187.449180 | 0.26142 | 139.426230 |
std | 1.795084 | 15.439766 | 0.01889 | 91.206363 |
min | 67.000000 | 150.000000 | 0.21200 | 50.000000 |
25% | 72.000000 | 175.000000 | 0.24800 | 76.000000 |
50% | 73.000000 | 190.000000 | 0.26000 | 109.000000 |
75% | 74.000000 | 195.000000 | 0.27400 | 173.000000 |
max | 78.000000 | 230.000000 | 0.32800 | 563.000000 |
pandas series provide a function corr()
to calculate the correlation coefficient between two series.
We will use the following packages in python for visualization.
seaborn
matplotlib
histplot(data=your data frame,x=x axis)
from seabornbarplot(data=your data, x=x axis, y=y axis)
from seabornboxplot(data=your data, x=x axis, y=y axis)
from seabornlineplot(data=your data frame,x=time column,y=data column)
from seabornyear | month | passengers | |
---|---|---|---|
0 | 1949 | January | 112 |
1 | 1949 | February | 118 |
2 | 1949 | March | 132 |
3 | 1949 | April | 129 |
4 | 1949 | May | 121 |
scatterplot(data=your data, x=x axis, y=y axis)
from seaborn