Data Visualization: How to Choose the Right Plots with Plot Interpretation

Understanding Data Types and Charts Interpretation

Abdulwasiu Tiamiyu
DataDrivenInvestor

--

Data visualization is the cornerstone of effective data communication. The choice of plots can make or break the clarity and impact of your insights. Join me on a journey to demystify this crucial skill and elevate your data storytelling game.

Photo by Myriam Jessier on Unsplash
Photo by Myriam Jessier on Unsplash

As someone passionate about working with data, I have often envisioned captivating visualizations only to struggle to bring them to life, especially in my early days. This frustration echoed in messages from learners facing the same challenge during my mentoring journey.

These experiences inspired me to write a guide for fellow data enthusiasts, particularly those starting their data careers. This post aims to demystify data visualization, helping you find the perfect plot for any data type. By the end, you’ll feel more confident in creating visually engaging charts that make an impact.

Table of Content

· Understanding Data Types
· Dataset
· Choosing the Right Plots
· 1. Univariate Plots
a. Numerical Variable:
b. Categorical Variable:
· 2. Bivariate Plots
a. Numerical vs Numerical variables:
b. Numerical vs Categorical variables:
c. Categorical vs Categorical variables:
· 3. Multivariate Plots
a. Two numerical and one categorical variables — scatterplot
b. One numerical and two categorical variables — barplot
c(i). One numerical and two categorical variables — point plot
c(ii). One numerical and two categorical variables — boxplot

Understanding Data Types

Data types form the foundation of any data analysis process, playing a crucial role in data understanding and manipulation. In the realm of data science, data can be classified into various types: numerical, categorical, textual, and time-series.

  • Numerical data: This is sometimes referred to as quantitative data and includes continuous and discrete values, often used for quantitative measurements and calculations. Numerical data is further divided into discrete and continuous data types.
    - Discrete data consists of individual, separate values that are distinct and countable, often integers or whole numbers and do not take on any intermediate values between two data points. Examples include; the number of students in a class, the count of cars in a parking lot, or the number of items sold in a store.
    - Continuous data represent measurements that can take any value within a specific range. Examples include; height, weight, and temperature readings.
  • Categorical data: Also known as qualitative data, represents distinct groups or categories. Categorical data is further divided into nominal and ordinal data types
    - Nominal — a categorical type of data where the values represent distinct categories with no inherent order or ranking among them. Examples include; colours, gender (male/female), blood group, or types of animals (e.g., cat, dog, bird)
    - Ordinal also a categorical type of data, but in this case, the values have a natural order or ranking. However, the differences between the ranks are not necessarily uniform or quantifiable. Examples include; position in a class (1st, 2nd, 3rd), or survey responses with options like “strongly disagree,” “disagree,” “neutral,” “agree,” and “strongly agree.”
  • Textual data: Encompasses unstructured text, requiring specialized techniques like Natural Language Processing (NLP) for analysis.
  • Time-series data: Involves temporal information, crucial for studying trends and patterns over time.
Photo by piscine26 on Freepik
Photo by piscine26 on Freepik

For this guide, our focus will be on data visualization of numerical and categorical data types using the Python libraries.

Dataset

The dataset utilized in this article is publicly available insurance data sourced from Kaggle. It comprises 1338 observations, each representing an individual, and encompasses 7 features. Among these features, 4 are numerical, providing information such as age, Body Mass Index (BMI), number of children, and insurance charges. The remaining 3 features are nominal, including categorical data on sex, smoking status, and region. This dataset offers a comprehensive view of individuals’ characteristics and insurance-related information.

# Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
# Read the dataset
df = pd.read_csv('https://raw.githubusercontent.com/Tiamiyu1/Health-Insurance-Analysis/main/insurance.csv')
df.head()
df.head()
df.head()

Choosing the Right Plots

Effective data visualization begins with selecting appropriate plots for the data at hand. After identifying the data type for each variable, the next step is to match it with a suitable plot.

1. Univariate Plots

These plots are used when analyzing one variable at a time.

a. Numerical Variable:

  • histogram: Represents the distribution of one or more variables by counting the number of observations within discrete bins. A similar plot is the KDE plot for smooth distribution of the variable.
sns.histplot(df.bmi)
plt.title('BMI Histogram Distribution')
BMI Histogram Distribution
BMI Histogram Distribution

Observation: The distribution of BMI values exhibits a bell-shaped curve, indicative of a normal distribution. This suggests that the majority of individuals in the dataset have BMI values clustered around the central point (mean), with fewer instances of extreme values.

  • count plot: Shows the counts of observations in each categorical bin using bars
sns.countplot(x=df.children)
plt.title('No of Children Count')
No of Children Count
No of Children Count

Observation: The count plot reveals a distinct distribution of individuals in the insurance dataset based on the number of children. The majority of individuals have no children, forming the largest group. Following this, the frequency gradually decreases as the number of children increases.

  • boxplot: A graphical representation showing the distribution of data, including the median, quartiles, and outliers using a box-and-whisker format. Other plots in this category are the swarm plot and violin plot.
sns.boxplot(df.charges)
plt.title('Charges Boxplot')
Charges Boxplot
Charges Boxplot

Observation: The boxplot depicts a median value of approximately 10,000, indicating the central tendency of the dataset. The interquartile range (IQR), represented by the box between the 1st and 3rd quartiles, spans from around 1,000 to 17,000. This suggests that the middle 50% of the data is concentrated within this range.
Additionally, the presence of values beyond the upper whisker is observed, suggesting potential outliers in the upper range of the dataset. While these values are not significantly distant from the upper whisker, their existence warrants further investigation.

Note: Boxplot and histogram plots are more suitable for continuous numerical variables (e.g., charges and BMI), while the count plot is suitable for discrete numerical variables (e.g., No of Children).

b. Categorical Variable:

- Countplot: Shows the counts of observations in each category using bars.

sns.countplot(x=df.region)
plt.title('Region Count')
Region Count
Region Count

Observation: The count plot illustrates a notable disparity in the frequency of individuals across different regions. The southeast region stands out with the highest frequency, suggesting a concentration of data points in that specific area. In contrast, the frequencies in the other three regions are closely clustered, indicating a relatively balanced distribution among these regions.

  • Piechart: A circular graph visually representing the proportional distribution of different categories or components of a data set feature/variable, often in the form of percentages.
sorted_counts = df['region'].value_counts()

plt.pie(sorted_counts, labels=sorted_counts.index, autopct='%.0f%%')
plt.title('Region Pie Chart')
Region Pie Chart
Region Pie Chart

Observation: The pie chart serves as an alternative representation of the regional distribution observed in the count plot above. Each segment of the pie corresponds to a specific region, and the size of each slice is proportional to the percentage of individuals from that region in the dataset.

2. Bivariate Plots

These plots are used when analyzing two variables at a time, with four subcategories.

a. Numerical vs Numerical variables:

- Scatter plot: Displays individual data points as dots on a two-dimensional plane, allowing for the visualization of the relationship or correlation between two variables. Similar plots include line plots, regression plots and colour-coded heatmap.

sns.scatterplot(x=df.bmi, y=df.charges)
plt.title('Scatter Plot of Charges and BMI')
Scatter Plot of Charges and BMI
Scatter Plot of Charges and BMI

Observation: The scatter plot shows the relationship between BMI and charges. A general trend of positive correlation is apparent, suggesting that, on average, as BMI increases, charges tend to increase as well. However, it’s important to note that the correlation is not uniform for every data point.

b. Numerical vs Categorical variables:

- Boxplot: Shows the distribution of quantitative data, facilitating comparisons between variables or across levels of a categorical variable.

sns.boxplot(x=df.sex, y=df.bmi)
plt.title('Boxplot Showing BMI Distribution by Gender')
Boxplot Showing BMI Distribution by Gender
Boxplot Showing BMI Distribution by Gender

Observation: The boxplots for both male and female charges reveal interesting insights. While the median values for both genders are close, suggesting a similar central tendency, notable differences are observed in the spread of the data. The interquartile range (IQR) for males appears wider, indicating a greater variability in charges among males compared to females. The larger upper whisker for males reinforces this, suggesting that the upper range of charges for males extends further than that of females.
Additionally, both boxplots exhibit a few data points above the upper whisker, indicating potential outliers in both male and female charges. These outliers suggest the presence of extreme values that deviate from the general trend within each gender.

  • Barplot: A graphical representation using rectangular bars of varying heights or lengths to compare different categorical data or discrete values. The default estimator is the mean but can be changed to any other measure of central tendency.
sns.barplot(x=df.smoker, y=df.charges)
plt.title('Barplot showing the Average Charges by Smoking Status')
Barplot showing the Average Charges by Smoking Status
Barplot showing the Average Charges by Smoking Status

Observation: Despite the majority of individuals being non-smokers, a striking observation emerges when comparing the average charges. On average, the charges for smokers are more than three times higher than those for non-smokers. This substantial difference underscores the significant impact of smoking status on medical charges within the dataset.

c. Categorical vs Categorical variables:

- Countplot (with the hue parameter):

sns.countplot(x=df.sex, hue=df.smoker)
plt.title('Smoker Count by Gender')
Count Plot for Smoker by Gender
Count Plot for Smoker by Gender

Observation: The dataset exhibits a nearly equal distribution between genders, with males slightly outnumbering females. However, a notable disparity is observed in smoking habits, where the number of male smokers surpasses that of females.

3. Multivariate Plots

These plots are used when analyzing more than two variables at a time, i.e., three or more variables.

a. Two numerical and one categorical variables — scatterplot

sns.scatterplot(x=df.bmi, y=df.charges,  hue=df.smoker)
plt.title('Scatter Plot of Charges and BMI vs Smoker')
Scatter Plot of Charges and BMI vs Smoker
Scatter Plot of Charges and BMI vs Smoker

Observation: Building upon our previous analysis of the scatter plot, a distinct trend emerges when considering the interaction between BMI, charges, and smoking status. The color-coded points (orange for smokers and blue for non-smokers) reveal that the observed increase in charges with rising BMI is primarily applicable to smokers. For non-smokers, the relationship between BMI and charges appears to remain relatively stable.

b. One numerical and two categorical variables — barplot

sns.barplot(x=df.smoker, y=df.charges, hue=df.sex)
plt.title('Average Charges by Smoking Status and Gender')
num2cat-b
Average Charges by Smoking Status and Gender

Observation: On average, both male and female smokers exhibit significantly higher charges compared to their non-smoking counterparts. However, within each gender, distinctions are evident. Male smokers, on average, have higher charges than female smokers, emphasizing a gender-specific impact of smoking on healthcare costs. Conversely, among non-smokers, females have higher average charges compared to males.

c(i). One numerical and two categorical variables — point plot

sns.pointplot(x=df.region, y=df.bmi, hue=df.sex)
plt.title('Pointplot of Average BMI by Region and Gender')
Point plot of Average BMI by Region and Gender
Point plot of Average BMI by Region and Gender

Observation: In the Southeast and Southwest regions, males exhibit higher average BMI values compared to females. Conversely, in the Northwest and Northeast regions, females have higher average BMI values than males.
Of particular note is the Southeast region, where both male and female average BMI values are notably higher compared to the other regions.

c(ii). One numerical and two categorical variables — boxplot

sns.boxplot(x=df.smoker, y=df.bmi, hue=df.sex)
plt.title('BMI Boxplot Distribution by Gender and Smoking Status');
BMI Distribution by Gender and Smoking Status
BMI Boxplot Distribution by Gender and Smoking Status

Observation: Non-smokers, irrespective of gender, exhibit relatively similar median BMI values. However, a notable gender disparity becomes evident among smokers, with male smokers displaying a higher median BMI compared to their female counterparts.

This guide provides a general overview, but there are many more visualizations to explore, including the powerful subplot and Persian correlation plot for multivariate analysis.

Explore the code for all the charts featured in this article, available in this notebook. Dive in and elevate your data visualization game!

I hope you find this exploration into data visualization intriguing! For those eager to enhance the appeal and presentation of their visualizations, I’ve got more insights for you — 10 Data Visualization Designs You Need to Know About.

I delve into topics like:

  • Plot title and axes label design
  • Label rotation for better readability
  • Strategies to remove uninformative colors
  • Incorporating figures like count values and percentages
  • Pie chart formatting and more…

In conclusion, mastering data visualization is a journey of turning raw data into compelling stories. From understanding data types to choosing the right plots and chart interpretation, this guide aimed to demystify the art of visualization. Now armed with practical tips, may your charts tell impactful stories and inspire your data career journey. Happy charting!

Share this with your network if you find it interesting!

Connect with me on Medium, LinkedIn

Subscribe to DDIntel Here.

Have a unique story to share? Submit to DDIntel here.

Join our creator ecosystem here.

DDIntel captures the more notable pieces from our main site and our popular DDI Medium publication. Check us out for more insightful work from our community.

DDI Official Telegram Channel: https://t.me/+tafUp6ecEys4YjQ1

Follow us on LinkedIn, Twitter, YouTube, and Facebook.

--

--

Machine learning, AI and Data Science enthusiast with a special interest in Bioinformatics and Molecular Biology. https://www.linkedin.com/in/tiamiyu1