When given a new dataset, the first step should always be to perform Exploratory Data Analysis (EDA) - taking a broad, high-level look at what the dataset contains. This is done in order to get to grips with the data and its format and will include data visualisation to give a quick overview of various aspects of the data at a glance.
The code on the page will uses the kagglehub, Matplotlib, statsmodels, missingno, pandas and seaborn libraries:
from kagglehub import KaggleDatasetAdapter
from matplotlib import pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic
import kagglehub
import missingno as msno
import pandas as pd
import seaborn as sns
If you do not have one or more of these libraries installed you can do so from the terminal:
$ python3.13 -m pip install missingno
$ python3.13 -m pip install kagglehub
$ python3.13 -m pip install statsmodels
$ python3.13 -m pip install matplotlib
$ python3.13 -m pip install pandas
$ python3.13 -m pip install seaborn
This page will use laptops.csv
from Kaggle:
# Load the data from
# https://www.kaggle.com/datasets/juanmerinobermejo/laptops-price-dataset
df = kagglehub.dataset_load(
KaggleDatasetAdapter.PANDAS,
'juanmerinobermejo/laptops-price-dataset',
'laptops.csv',
)
df = df.reset_index()
df = df.drop('Laptop', axis=1)
This dataset contains information about laptops: their features and their prices. Don’t worry if you’ve never seen this dataset before or don’t know anything about laptops, the entire point of this exercise is to take an unknown dataset and explore it!
This step aims to understand the structure of the data:
Start by having a look at the first 5 rows (this might require you to change the pandas display options so that it doesn’t get truncated):
# Increase the maximum number of columns that get printed
pd.set_option('display.max_columns', 20)
# Increase the width of the printed output
pd.set_option('display.width', 200)
# Display the first 5 rows
print(df.head())
## index Status Brand Model CPU RAM Storage Storage type GPU Screen Touch Final Price
## 0 0 New Asus ExpertBook Intel Core i5 8 512 SSD NaN 15.6 No 1009.00
## 1 1 New Alurin Go Intel Celeron 8 256 SSD NaN 15.6 No 299.00
## 2 2 New Asus ExpertBook Intel Core i3 8 256 SSD NaN 15.6 No 789.00
## 3 3 New MSI Katana Intel Core i7 16 1000 SSD RTX 3050 15.6 No 1199.00
## 4 4 New HP 15S Intel Core i5 16 512 SSD NaN 15.6 No 669.01
Clearly the index
column is just an incrementing ID
number, so we can delete that:
df = df.drop('index', axis=1)
Check the number of rows and columns:
print(df.shape)
## (2160, 11)
Check the column names and data types:
print(df.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2160 entries, 0 to 2159
## Data columns (total 11 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Status 2160 non-null object
## 1 Brand 2160 non-null object
## 2 Model 2160 non-null object
## 3 CPU 2160 non-null object
## 4 RAM 2160 non-null int64
## 5 Storage 2160 non-null int64
## 6 Storage type 2118 non-null object
## 7 GPU 789 non-null object
## 8 Screen 2156 non-null float64
## 9 Touch 2160 non-null object
## 10 Final Price 2160 non-null float64
## dtypes: float64(2), int64(2), object(7)
## memory usage: 185.8+ KB
## None
The above table tells use that most columns - but not all - have 2160 non-null values. Let’s take a closer look at the data that is missing:
# Count the missing values
print(df.isnull().sum())
## Status 0
## Brand 0
## Model 0
## CPU 0
## RAM 0
## Storage 0
## Storage type 42
## GPU 1371
## Screen 4
## Touch 0
## Final Price 0
## dtype: int64
# Visualise the missing data
msno.matrix(df)
plt.show()
Hmm, this tells us that a lot of the GPU data is missing. We might have to exclude this column when doing an analysis, or maybe a null value in this column is just a clumsy way to indicate that that laptop does not have a GPU? We will have to take a closer look at a later stage.
Check the data types:
print(df.dtypes)
## Status object
## Brand object
## Model object
## CPU object
## RAM int64
## Storage int64
## Storage type object
## GPU object
## Screen float64
## Touch object
## Final Price float64
## dtype: object
Identify the categorical and numerical variables:
numerical = df.select_dtypes(include=['int64', 'float64']).columns
categorical = df.select_dtypes(include=['object']).columns
print('Numerical:', list(numerical))
## Numerical: ['RAM', 'Storage', 'Screen', 'Final Price']
print('Categorical:', list(categorical))
## Categorical: ['Status', 'Brand', 'Model', 'CPU', 'Storage type', 'GPU', 'Touch']
Summary statistics of the numerical variables:
print(df.describe())
## RAM Storage Screen Final Price
## count 2160.000000 2160.000000 2156.000000 2160.000000
## mean 15.413889 596.294444 15.168112 1312.638509
## std 9.867815 361.220506 1.203329 911.475417
## min 4.000000 0.000000 10.100000 201.050000
## 25% 8.000000 256.000000 14.000000 661.082500
## 50% 16.000000 512.000000 15.600000 1031.945000
## 75% 16.000000 1000.000000 15.600000 1708.970000
## max 128.000000 4000.000000 18.000000 7150.470000
Summary statistics of the categorical variables:
print(df.describe(include=['O']))
## Status Brand Model CPU Storage type GPU Touch
## count 2160 2160 2160 2160 2118 789 2160
## unique 2 27 121 28 2 44 2
## top New Asus 15S Intel Core i7 SSD RTX 3050 No
## freq 1498 415 115 710 2062 129 1933
Take a look at each variable individually.
fig, axs = plt.subplots(len(numerical), 2, figsize=(12.8, 16))
for i, var in enumerate(numerical):
ax = axs[i, 0]
ax.set_xlabel(var)
ax.set_ylabel('Frequency')
df[var].hist(bins=20, ax=ax)
ax = axs[i, 1]
df.boxplot(column=var, ax=ax)
fig.tight_layout()
plt.show()
Print the counts of each value that appears in the categorical columns:
for i, var in enumerate(categorical):
freq = df[var].value_counts()
print(freq.head())
print()
## Status
## New 1498
## Refurbished 662
## Name: count, dtype: int64
##
## Brand
## Asus 415
## HP 368
## Lenovo 366
## MSI 308
## Acer 137
## Name: count, dtype: int64
##
## Model
## 15S 115
## IdeaPad 104
## ROG 101
## VivoBook 99
## ThinkPad 99
## Name: count, dtype: int64
##
## CPU
## Intel Core i7 710
## Intel Core i5 535
## AMD Ryzen 7 156
## Intel Core i3 130
## AMD Ryzen 5 127
## Name: count, dtype: int64
##
## Storage type
## SSD 2062
## eMMC 56
## Name: count, dtype: int64
##
## GPU
## RTX 3050 129
## RTX 3060 122
## RTX 3070 97
## RTX 4060 62
## RTX 3080 51
## Name: count, dtype: int64
##
## Touch
## No 1933
## Yes 227
## Name: count, dtype: int64
Plot these counts:
fig, axs = plt.subplots(len(categorical), 2, figsize=(12.8, 32))
for i, var in enumerate(categorical):
freq = df[var].value_counts()
ax = axs[i, 0]
freq.plot(kind='bar', ax=ax)
ax = axs[i, 1]
mosaic(freq.to_dict(), ax=ax)
fig.tight_layout()
plt.show()
Take a look at two variables at a time.
fig, axs = plt.subplots(3, 2, figsize=(12.8, 12))
ax = axs[0, 0]
df.plot.scatter(x='RAM', y='Storage', ax=ax)
ax = axs[0, 1]
df.plot.scatter(x='RAM', y='Screen', ax=ax)
ax = axs[1, 0]
df.plot.scatter(x='RAM', y='Final Price', ax=ax)
ax = axs[1, 1]
df.plot.scatter(x='Storage', y='Screen', ax=ax)
ax = axs[2, 0]
df.plot.scatter(x='Storage', y='Final Price', ax=ax)
ax = axs[2, 1]
df.plot.scatter(x='Screen', y='Final Price', ax=ax)
fig.tight_layout()
plt.show()
Correlation heatmap:
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()
df.boxplot(column='Final Price', by='Brand', rot=90)
plt.show()
pd.crosstab(df['Status'], df['Touch'])
## Touch No Yes
## Status
## New 1320 178
## Refurbished 613 49
mosaic(df, ['Status', 'Touch'])
plt.show()
Outliers may or may not be relevant in an analysis and so the way in which they get handled depends on a number of factors that are usually specific to the analysis method being used. Regardless, identifying unusual values that might affect the analysis down the line is a useful step in the initial data exploration phase.
bp = plt.boxplot(df['Final Price'])
plt.show()
q1 = df['Final Price'].quantile(0.25)
q3 = df['Final Price'].quantile(0.75)
iqr = q3 - q1
outliers = df[(df['Final Price'] < q1 - 1.5*iqr) | (df['Final Price'] > q3 + 1.5*iqr)]
print(outliers.head())
## Status Brand Model CPU RAM Storage Storage type GPU Screen Touch Final Price
## 100 New Razer Blade Intel Core i7 16 1000 SSD RTX 3070 15.6 No 3299.99
## 292 New Asus ROG Intel Core i9 32 1000 SSD RTX 4090 16.0 No 3699.01
## 307 New Asus ROG Intel Core i9 32 1000 SSD RTX 4090 18.0 No 3699.01
## 351 New Asus ROG Intel Core i9 32 1000 SSD RTX 4080 16.0 No 3599.00
## 361 New Asus ROG Intel Core i9 32 1000 SSD RTX 4080 18.0 No 3399.00
Similar to outliers, the way in which duplicates get handled depends on the exact analysis method being used. The code below simply removes them:
print(df.duplicated().sum())
## 62
df = df.drop_duplicates()
Exploratory Data Analysis (EDA) is a crucial first step in any data science workflow:
Doing EDA at the start of a process makes it easier to have faith in the reliability of the downstream analysis. It also helps in focussing on the research and statistical questions that need to be answered as, ultimately, these are what need to be answered by the data.