⇦ Back

When given a new dataset, the first step should always be to perform Exploratory Data Analysis (EDA) - taking a broad, high-level look at what the dataset contains. This is done in order to get to grips with the data and its format and will include data visualisation to give a quick overview of various aspects of the data at a glance.

1 Load the Libraries

The code on the page will uses the kagglehub, Matplotlib, statsmodels, missingno, pandas and seaborn libraries:

from kagglehub import KaggleDatasetAdapter
from matplotlib import pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic
import kagglehub
import missingno as msno
import pandas as pd
import seaborn as sns

If you do not have one or more of these libraries installed you can do so from the terminal:

$ python3.13 -m pip install missingno
$ python3.13 -m pip install kagglehub
$ python3.13 -m pip install statsmodels
$ python3.13 -m pip install matplotlib
$ python3.13 -m pip install pandas
$ python3.13 -m pip install seaborn

2 Load the Data

This page will use laptops.csv from Kaggle:

# Load the data from
# https://www.kaggle.com/datasets/juanmerinobermejo/laptops-price-dataset
df = kagglehub.dataset_load(
    KaggleDatasetAdapter.PANDAS,
    'juanmerinobermejo/laptops-price-dataset',
    'laptops.csv',
)
df = df.reset_index()
df = df.drop('Laptop', axis=1)

This dataset contains information about laptops: their features and their prices. Don’t worry if you’ve never seen this dataset before or don’t know anything about laptops, the entire point of this exercise is to take an unknown dataset and explore it!

3 Data Profiling

This step aims to understand the structure of the data:

  • Number of rows and columns
  • Data types and ranges
    • Categorical vs numerical vs datetime variables
  • Summary statistics (mean, median, min, max, percentiles, standard deviation)
  • Unique values and counts

3.1 Initial Look

Start by having a look at the first 5 rows (this might require you to change the pandas display options so that it doesn’t get truncated):

# Increase the maximum number of columns that get printed
pd.set_option('display.max_columns', 20)
# Increase the width of the printed output
pd.set_option('display.width', 200)
# Display the first 5 rows
print(df.head())
##    index Status   Brand       Model            CPU  RAM  Storage Storage type       GPU  Screen Touch  Final Price
## 0      0    New    Asus  ExpertBook  Intel Core i5    8      512          SSD       NaN    15.6    No      1009.00
## 1      1    New  Alurin          Go  Intel Celeron    8      256          SSD       NaN    15.6    No       299.00
## 2      2    New    Asus  ExpertBook  Intel Core i3    8      256          SSD       NaN    15.6    No       789.00
## 3      3    New     MSI      Katana  Intel Core i7   16     1000          SSD  RTX 3050    15.6    No      1199.00
## 4      4    New      HP         15S  Intel Core i5   16      512          SSD       NaN    15.6    No       669.01

Clearly the index column is just an incrementing ID number, so we can delete that:

df = df.drop('index', axis=1)

Check the number of rows and columns:

print(df.shape)
## (2160, 11)

Check the column names and data types:

print(df.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2160 entries, 0 to 2159
## Data columns (total 11 columns):
##  #   Column        Non-Null Count  Dtype  
## ---  ------        --------------  -----  
##  0   Status        2160 non-null   object 
##  1   Brand         2160 non-null   object 
##  2   Model         2160 non-null   object 
##  3   CPU           2160 non-null   object 
##  4   RAM           2160 non-null   int64  
##  5   Storage       2160 non-null   int64  
##  6   Storage type  2118 non-null   object 
##  7   GPU           789 non-null    object 
##  8   Screen        2156 non-null   float64
##  9   Touch         2160 non-null   object 
##  10  Final Price   2160 non-null   float64
## dtypes: float64(2), int64(2), object(7)
## memory usage: 185.8+ KB
## None

3.2 Missing Data

The above table tells use that most columns - but not all - have 2160 non-null values. Let’s take a closer look at the data that is missing:

# Count the missing values
print(df.isnull().sum())
## Status             0
## Brand              0
## Model              0
## CPU                0
## RAM                0
## Storage            0
## Storage type      42
## GPU             1371
## Screen             4
## Touch              0
## Final Price        0
## dtype: int64
# Visualise the missing data
msno.matrix(df)
plt.show()

Hmm, this tells us that a lot of the GPU data is missing. We might have to exclude this column when doing an analysis, or maybe a null value in this column is just a clumsy way to indicate that that laptop does not have a GPU? We will have to take a closer look at a later stage.

3.3 Data Types

Check the data types:

print(df.dtypes)
## Status           object
## Brand            object
## Model            object
## CPU              object
## RAM               int64
## Storage           int64
## Storage type     object
## GPU              object
## Screen          float64
## Touch            object
## Final Price     float64
## dtype: object

Identify the categorical and numerical variables:

numerical = df.select_dtypes(include=['int64', 'float64']).columns
categorical = df.select_dtypes(include=['object']).columns
print('Numerical:', list(numerical))
## Numerical: ['RAM', 'Storage', 'Screen', 'Final Price']
print('Categorical:', list(categorical))
## Categorical: ['Status', 'Brand', 'Model', 'CPU', 'Storage type', 'GPU', 'Touch']

3.4 Summary Statistics

Summary statistics of the numerical variables:

print(df.describe())
##                RAM      Storage       Screen  Final Price
## count  2160.000000  2160.000000  2156.000000  2160.000000
## mean     15.413889   596.294444    15.168112  1312.638509
## std       9.867815   361.220506     1.203329   911.475417
## min       4.000000     0.000000    10.100000   201.050000
## 25%       8.000000   256.000000    14.000000   661.082500
## 50%      16.000000   512.000000    15.600000  1031.945000
## 75%      16.000000  1000.000000    15.600000  1708.970000
## max     128.000000  4000.000000    18.000000  7150.470000

Summary statistics of the categorical variables:

print(df.describe(include=['O']))
##        Status Brand Model            CPU Storage type       GPU Touch
## count    2160  2160  2160           2160         2118       789  2160
## unique      2    27   121             28            2        44     2
## top       New  Asus   15S  Intel Core i7          SSD  RTX 3050    No
## freq     1498   415   115            710         2062       129  1933

4 Univariate Analysis

Take a look at each variable individually.

4.1 Numerical Variables

fig, axs = plt.subplots(len(numerical), 2, figsize=(12.8, 16))

for i, var in enumerate(numerical):
    ax = axs[i, 0]
    ax.set_xlabel(var)
    ax.set_ylabel('Frequency')
    df[var].hist(bins=20, ax=ax)

    ax = axs[i, 1]
    df.boxplot(column=var, ax=ax)

fig.tight_layout()
plt.show()

4.2 Categorical Variables

Print the counts of each value that appears in the categorical columns:

for i, var in enumerate(categorical):
    freq = df[var].value_counts()
    print(freq.head())
    print()
## Status
## New            1498
## Refurbished     662
## Name: count, dtype: int64
## 
## Brand
## Asus      415
## HP        368
## Lenovo    366
## MSI       308
## Acer      137
## Name: count, dtype: int64
## 
## Model
## 15S         115
## IdeaPad     104
## ROG         101
## VivoBook     99
## ThinkPad     99
## Name: count, dtype: int64
## 
## CPU
## Intel Core i7    710
## Intel Core i5    535
## AMD Ryzen 7      156
## Intel Core i3    130
## AMD Ryzen 5      127
## Name: count, dtype: int64
## 
## Storage type
## SSD     2062
## eMMC      56
## Name: count, dtype: int64
## 
## GPU
## RTX 3050    129
## RTX 3060    122
## RTX 3070     97
## RTX 4060     62
## RTX 3080     51
## Name: count, dtype: int64
## 
## Touch
## No     1933
## Yes     227
## Name: count, dtype: int64

Plot these counts:

fig, axs = plt.subplots(len(categorical), 2, figsize=(12.8, 32))

for i, var in enumerate(categorical):
    freq = df[var].value_counts()

    ax = axs[i, 0]
    freq.plot(kind='bar', ax=ax)

    ax = axs[i, 1]
    mosaic(freq.to_dict(), ax=ax)

fig.tight_layout()
plt.show()

5 Bivariate Analysis

Take a look at two variables at a time.

5.1 Numerical vs Numerical

fig, axs = plt.subplots(3, 2, figsize=(12.8, 12))
ax = axs[0, 0]
df.plot.scatter(x='RAM', y='Storage', ax=ax)
ax = axs[0, 1]
df.plot.scatter(x='RAM', y='Screen', ax=ax)
ax = axs[1, 0]
df.plot.scatter(x='RAM', y='Final Price', ax=ax)
ax = axs[1, 1]
df.plot.scatter(x='Storage', y='Screen', ax=ax)
ax = axs[2, 0]
df.plot.scatter(x='Storage', y='Final Price', ax=ax)
ax = axs[2, 1]
df.plot.scatter(x='Screen', y='Final Price', ax=ax)
fig.tight_layout()
plt.show()

Correlation heatmap:

corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()

5.2 Categorical vs Numerical

df.boxplot(column='Final Price', by='Brand', rot=90)
plt.show()

5.3 Categorical vs Categorical

pd.crosstab(df['Status'], df['Touch'])
## Touch          No  Yes
## Status                
## New          1320  178
## Refurbished   613   49
mosaic(df, ['Status', 'Touch'])
plt.show()

6 Outlier Detection

Outliers may or may not be relevant in an analysis and so the way in which they get handled depends on a number of factors that are usually specific to the analysis method being used. Regardless, identifying unusual values that might affect the analysis down the line is a useful step in the initial data exploration phase.

bp = plt.boxplot(df['Final Price'])
plt.show()

q1 = df['Final Price'].quantile(0.25)
q3 = df['Final Price'].quantile(0.75)
iqr = q3 - q1
outliers = df[(df['Final Price'] < q1 - 1.5*iqr) | (df['Final Price'] > q3 + 1.5*iqr)]
print(outliers.head())
##     Status  Brand  Model            CPU  RAM  Storage Storage type       GPU  Screen Touch  Final Price
## 100    New  Razer  Blade  Intel Core i7   16     1000          SSD  RTX 3070    15.6    No      3299.99
## 292    New   Asus    ROG  Intel Core i9   32     1000          SSD  RTX 4090    16.0    No      3699.01
## 307    New   Asus    ROG  Intel Core i9   32     1000          SSD  RTX 4090    18.0    No      3699.01
## 351    New   Asus    ROG  Intel Core i9   32     1000          SSD  RTX 4080    16.0    No      3599.00
## 361    New   Asus    ROG  Intel Core i9   32     1000          SSD  RTX 4080    18.0    No      3399.00

7 Handle Duplicates

Similar to outliers, the way in which duplicates get handled depends on the exact analysis method being used. The code below simply removes them:

print(df.duplicated().sum())
## 62
df = df.drop_duplicates()

8 Summary

Exploratory Data Analysis (EDA) is a crucial first step in any data science workflow:

  • It helps in understanding the structure, quality and patterns that exist in the data before moving on to modelling
  • EDA includes data profiling, visualisation and statistical exploration
  • The above help to uncover missing values, outliers, relationships and distributions that can significantly impact results
  • The exploratory process can help to validate model assumptions, highlight potential biases and/or reveal new insights or features

Doing EDA at the start of a process makes it easier to have faith in the reliability of the downstream analysis. It also helps in focussing on the research and statistical questions that need to be answered as, ultimately, these are what need to be answered by the data.

⇦ Back