3 Python Visualization Libraries You MUST Know as A Data Scientist

7 minute read

In real life, data preprocessing is really a pain for most data scientists. But with the help of data visualization libraries, it actually can be fun to play with.

image-center

Visual by Gary Killian, via VectorStock.

Me, for example, frequently dealing with data in the medical industry, usually facing dataset that is sparse, miss-labeled and legacy(oops). A great visualization technique can quickly give you a deep insight into the data, sometimes even before training, or making a better representation of existing model explanation and extraction.

As the title suggests, we are gonna quickly CODE through the three most exciting visualization libraries in Python, and help you look through the disguise of your poor data.

Overview:

  • Matplotlib: Low-level detailed dev tool, freedom of coding
  • Seaborn: “statistical data visualization”, based on Matplotlib
  • Missingno: a small handy lightweight tool for NaN representation

Env Setup:

We are taking advantage of the power of Pandas and Numpy. If you do not have those tools in your hand, simply click the link above and follow those instructions. I personally recommend installing using Anaconda for both dependencies, type following commands in Anaconda Prompt and you are done!

conda install -c anaconda pandas
conda install -c anaconda numpy

Then, set up for our five visualization libraries:

Matplotlib:

conda install -c conda-forge matplotlib 

Seaborn:

conda install -c anaconda seaborn

Missingno:

conda install -c conda-forge missingno

Now, you are all set!

Data Setup:

I am using the data from Kaggle open dataset, and setup in Colab. I won’t bore you with the detail about the API calls and setup stuff, if you need help, you can DM me and I can walk you through it! Theses docs may help if you really want to work in the Colab, and I am sharing my workspace here as well.

# !pip install kaggle
%mkdir -p /content/playground/data/
%cd /content/playground/data/
!kaggle competitions download -c home-data-for-ml-course
!pip install quilt
!quilt install ResidentMario/missingno_data

This script is used in Colab, make a workspace on your local env, fetch data from Kaggle, then install missingno. The data we are using is Kaggle “Housing Prices Competition for Kaggle Learn Users”, which is a well documented open-source dataset.

import numpy as np
import pandas as pd
import missingno as msno
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
train=pd.read_csv("train.csv")
train.head()

Now, we import all dependencies, read the data, and glimpse at the first few lines of this dataset

FIELD1 Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000

Then, we only use a subset of the whole dataset for the purpose of demonstration

col_sel = ["Alley", "Street", "LotArea", "LotFrontage", "Utilities", "LandSlope", "Neighborhood", "YearBuilt", "YearRemodAdd"]
subTrain = train[col_sel]
subTrain.head()
FIELD1 Alley Street LotArea LotFrontage Utilities LandSlope Neighborhood YearBuilt YearRemodAdd
0 NaN Pave 8450 65.0 AllPub Gtl CollgCr 2003 2003
1 NaN Pave 9600 80.0 AllPub Gtl Veenker 1976 1976
2 NaN Pave 11250 68.0 AllPub Gtl CollgCr 2001 2002
3 NaN Pave 9550 60.0 AllPub Gtl Crawfor 1915 1970
4 NaN Pave 14260 84.0 AllPub Gtl NoRidge 2000 2000

Seaborn & Matplotlib

As we mentioned above, Matplotlib is a low-level plot tool that supports all plot functionalities. We usually combine the power of Seaborn and Matplotlib to get better representation. Some frequently used graphs in Seaborn:

  1. Heatmap

We normally using heatmap for correlation plot between features, while it can also be used to plot the relationship between any three features in 3D contour plot(at least one have to be number field)

f,ax = plt.subplots(figsize=(8,8))
sns.heatmap(subTrain.corr(),annot=True, fmt=".2f",ax=ax)
heatmap of correlation between columns

Only data with dtype!=string will have a correlation calculated. We see a high correlation between variable YearBuild and YearRemodAdd

  1. Pair Plot

pairplot is also a very handy tool for data summary. You can not only see the pairwise relation between features but also univariate plot, probability mass plot, etc.

g = sns.pairplot(subTrain)
g2 = sns.pairplot(subTrain, hue="LandSlope")
pairplot for subset data
pairplot for data varied with feature “Landscape”
  1. Violin Plot violinplot is generally used for distribution visualization and distribution comparison. For features differs in another feature with only two unique entries, we use split=True to compare between them
    scale = 1.5
    plt.figure(figsize=(len(col_sel)*scale, len(col_sel)*scale))
    sns.violinplot(x="LandSlope", y="LotArea", hue="Street", split=True, palette="Set2",data=subTrain, inner="stick",scale="count")
    plt.xticks(rotation=90);
    
violinplot amongst ‘LandSlope’, ‘LotArea’ and ‘Street’

As shown in graph, we see LandSlope and LotArea only affected by Street , which only has two types of entry: Pave and Grvl. The plate variable select color theme for this plot and inner=”stick” corresponding to each dataset entry


missingno & Matplotlib

Missingno is a small but fancy library for data visualization. It has regular charts such as heatmap, Bar Chart like Seaborn does but also has some unique charts such as the null Matrix chart and Dendrogram.

In Medical Science data, there will frequently have a lot of null data, the missingno come in very straightforward in null space representation, cannot be more recommended!

%matplotlib inline
msno.matrix(subTrain, figsize=(15, 15), color=(0.24, 0.77, 0.77))
Matrix in missingno

The missingno represent data with horizontal sticks, the absence of a stick in place shows a null value. We see the feature Alley very sparse, and the feature LotFrontAge has a fair amount of useful data. Therefore, for data preprocessing, we can decide to drop Alley and impute LotFrontage


Next Step and Beyond:

Those three libraries are very basic but extremely powerful in the Data Science area and are enough of 95% of data visualization jobs in the industry. Some next-level tools such as Plotly or Bokeh, also build upon Matplotplib, get you the power of building interactive plots and serves on Web applications. I will get into that area when I learn more about Python Backend techniques and Frontend engineering 👊