According to an article by Yufeng G -Google Cloud, a machine learning project consists of 7 steps: Data Collection, Data Preparation, Choosing Model, Training Model and Evaluating Model.
In this series of 7 steps, today we will discuss Data Visualization which is one of the key practices under Step 2 of Data Preparation. It helps you to analyse your data better as it gets bigger and complex.
Matplotlib has been the go-to library for data visualizations, however, it becomes quite frustrating to write several lines of code to get attractive graphs. This is where Seaborn comes in.
Introduction to Seaborn
Seaborn is a popular data visualization library in Python to create well-designed data visualizations of the data. This library is created on top of the Matplotlib library. And with fewer lines of code, it generates charts that have an aesthetic visual impact.
“If matplotlib “tries to make easy things easy and hard things possible”, seaborn tries to make a well-defined set of hard things easy too”
Important Features of Seaborn:
- When working with Pandas, it helps to visualize dataframes better than matplotlib
- Built-in themes, choices of colour palettes
- Fitting and Visualizing Linear Regression Models on dependent variables
Installing Seaborn
Seaborn has the following dependencies so make sure you have these installed beforehand:
- Python 3.6+
- NumPy (>= 1.13.3)
- SciPy (>= 1.0.1)
- pandas (>= 0.22.0)
- matplotlib (>= 2.1.2)
For installing with pip command:
pip install seaborn
If you are using conda (Anaconda Distribution):
conda install seaborn
I will be using Google’s Colab that provides Jupyter Notebook environment on the cloud. It is a free service and also has GPU functionality for faster processing of deep learning models.
Importing libraries and dataset
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns
datasets = sns.get_dataset_names()
Let’s Visualize Using Seaborn!
Based on the goals of the visualization, we will classify it into 3 categories:
- Trends: Visualizing patterns of change of a variable over time ( LinePlots )
- Relationships: Visualizing relationships between variables ( Bar Charts, ScatterPlot, Heatmap, Swarmplot , Regplot )
- Distribution: Visualizing distribution of values of a variable ( Histograms, KDE plots, Joint Plots, Boxplot )
Visualizing Trends
Line Charts are best to show patterns over time. To plot this we will be using Flights example dataset which has the following data:

#Loading Dataset df=sns.load_dataset( "flights") #Converting datatype to float df.year= df.year.astype(float) #Setting the size of chart plt.figure(figsize= (16,5)) #(Showing trends of passengers over months) sns.lineplot(x="month", y="passengers", data=df) plt.show()
We can see that passengers count starts to increase during the months May-August. As they are the months that fall under the vacation period of various schools and colleges.

Visualizing Relationships
1) Bar Charts: Best for comparing values belonging to different items
Using the same data we can compare value: passengers among different variables: months
plt.figure(figsize= (14,5)) sns.barplot(x="month", y="passengers",data=df) plt.show()

2) Heatmap: Helps you find patterns in data with the help of colour-coded schemes
Pivot is used to convert dataframe as index=’month’, column=’year’ and values=’passengers’. The Heatmap created below maps passenger count to each month of each year.
df=df.pivot('month', 'year', 'passengers')
plt.figure(figsize= (8,5) )
#annot displays values, fmt=d is int format
sns.heatmap(df, annot=True, fmt='d')

You can also change the colormap as follows:
sns.heatmap(df, annot=True, fmt='d', cmap='YlGnBu')

3) ScatterPlot: Shows Relationship between two continuous variables, if the plot is colour-coded then it will show a relation with the third variable as well.
sns.scatterplot( x="passengers", y="month", data=df)

Now we will add the third variable in the scatterplot as a color-code using hue parameter.
#Adding third variable as Hue sns.scatterplot( x="passengers", y="month", hue="year", data=df)

We can also add custom color palettes of our choice:
#Adding Color Palette sns.scatterplot( x="passengers", y="month", hue="year", palette='inferno_r', data=df)

4) Regplot and Lmplot: Regplot is used to add a regression line over the scatterplot to check any linear relationship among variables. Lmplot is used to add multiple regression lines if scatterplot has multiple groups.
Here we will be using Tips dataset which contains information about a restaurant that includes features as shown below. The goal here is analysing how the amount of tip given by customers changes based on other features.
tips= sns.load_dataset( 'tips') tips.head()

Implementing regplot on tip and total_bill
sns.regplot( x='total_bill', y='tip', data=tips)
The plot shows that the tip is positively correlated with the amount of bill.
Implementing lmplot by adding third category smoker as hue
sns.lmplot( x="total_bill", y="tip", hue="smoker", data=tips)
We see that customers who don’t smoke tend to pay more tip than those who smoke

5)Swarmplot: It is used to map the relationship between categorical variable and continuous variable.
sns.swarmplot( x="day", y="tip", data=tips)

Now we will add third variable time which will show us which time period of the day receives more tips.
sns.swarmplot( x="day", y="tip", hue='time', data=tips)

This shows that the restaurant receives most of its tips during Lunch at weekdays and during Dinner at weekends. We can replace time with smoker category to analyse who pays more tip each day.
#dodge is used to plot each category seperate sns.swarmplot( x="day", y="tip", hue='smoker', dodge=True, data=tips)

Next, we will extend this analysis to see a variation of swarm plot which is violin plot, which visualizes data in same manner but looks better as a visualisation. We will plot both of them for comparison.
sns.violinplot( x="day", y="tip", data=tips, inner=None) sns.swarmplot( x="day", y="tip", data=tips, color='white')

Visualizing Distributions
1)Histograms: Helps you to analyse the distribution of single numerical variables
sns.distplot( tips['tip'], color='red')
Here we see that most of the tips have high probability between $2 to $4.

2)Jointplot: It is used to analyse distributions of bivariate numeric variables.
sns.jointplot( x=tips.total_bill, y=tips.tip, kind='kde')
Here we see that tips of $2 to $4 on bills of $10 to $15 have high probability.

3)KDEplot: Its a smooth visualization version of histograms for a single numeric variable and two variables.
Here we will take new dataset called iris-dataset, which provides information on features about various species of iris plant.
iris= sns.load_dataset( "iris") iris.head()

#Obtaining Plants from dataset setosa = iris.loc[ iris.species == "setosa" ] virginica = iris.loc[ iris.species == "virginica" ] #Plotting KDE sns.kdeplot( setosa.sepal_length, setosa.sepal_width, label='setosa') sns.kdeplot( virginica.sepal_length, virginica.sepal_width, label='virginica') #labels are used to plot legend plt.legend()
Here we see the distribution of two iris plants based on their sepal_width and sepal_length.

4)Boxplot: It is used to obtain summary of variable and compare them, we can obtain values of three quartiles, minimum and maximum value.
#FacetGrid is used to plot multiple charts g=sns.FacetGrid( iris, col="species") g.map(sns.boxplot, "sepal_length")
Here we can see the distribution of sepal-length of three iris species.

Conclusion
Today we have learnt various plots for visualizing data based on trends, relationships and distributions using seaborn. We also learnt that how changing the color palette of a plot brings out its aesthetic value. If you want to explore seaborn more extensively I would recommend you to look at their documentation
Here is the link to my Colab Notebook where you can find the whole implementation. If you liked this tutorial and would want to read more of such content do comment and share this post. Happy Learning!



