Data Cleaning

Loading a dataset and running .head() on it can make data look deceptively tidy. But spend a few more minutes exploring and you’ll almost always find problems — rows with missing values, columns stored as the wrong type, duplicates, and inconsistencies that would silently corrupt your analysis if you didn’t catch them.

Data cleaning is the process of identifying and fixing these problems before you do anything else. It’s unglamorous work, but it’s the difference between analysis you can trust and analysis that leads you to the wrong conclusions.

Finding missing values

Missing values in Pandas are represented as NaN — short for “Not a Number.” They show up when a cell in your dataset is empty or couldn’t be read.

You can find them with:

df.isnull()           # True/False for every cell
df.isnull().sum()     # count of missing values per column

.isnull().sum() is the one you’ll use most. It gives you a quick snapshot of where your missing data is:

name        0
age         3
score       1
city        7
dtype: int64

This tells you at a glance which columns have problems and how bad they are.

Handling missing values

Once you know where the missing values are, you have two options: drop them or fill them.

Dropping rows with missing values:

df.dropna()                  # drop any row with at least one missing value
df.dropna(subset=["score"])  # drop only rows where score is missing

Filling missing values:

df.fillna(0)                          # replace all NaN with 0
df["score"].fillna(df["score"].mean()) # replace with the column mean
df["city"].fillna("Unknown")          # replace with a placeholder

Which approach you choose depends on the situation. Dropping rows makes sense when only a small number are missing and losing them won’t skew your analysis. Filling makes sense when dropping would lose too much data, or when a sensible default value exists.

Fixing data types

When Pandas loads a CSV it makes its best guess about what type each column is. It doesn’t always get it right.

Check your data types with:

df.dtypes

A common problem is a column of numbers being read as text (called object in Pandas):

name      object
age       object   ← should be int
score     float64
joined    object   ← should be datetime

Fix it with .astype():

df["age"] = df["age"].astype(int)

For dates, use pd.to_datetime():

df["joined"] = pd.to_datetime(df["joined"])

Getting types right matters because Pandas won’t let you do math on a column it thinks is text, and date operations won’t work until the column is actually a datetime type.

Removing duplicates

Duplicate rows sneak into datasets more often than you’d expect — data exports, merges, and manual entry are all common sources.

Find them with:

df.duplicated().sum()  # how many duplicate rows exist

Remove them with:

df.drop_duplicates()

You can also check for duplicates based on specific columns only:

df.drop_duplicates(subset=["name", "score"])

This removes rows where both the name and score are identical, even if other columns differ.

Fixing inconsistent values

Text data is especially prone to inconsistency. The same value spelled different ways, extra spaces, mixed capitalisation — these all cause problems when you try to filter or group your data.

df["city"].unique()
# ['New York', 'new york', 'NEW YORK', ' New York']

These are all the same city, but Pandas treats them as four different values. Fix it by standardising:

df["city"] = df["city"].str.strip()    # remove leading/trailing spaces
df["city"] = df["city"].str.lower()    # make everything lowercase

Or combine them:

df["city"] = df["city"].str.strip().str.lower()

Now all four variants become "new york" and your data is consistent.

Renaming and dropping columns

Datasets often come with column names that are awkward to work with — spaces, inconsistent capitalisation, cryptic abbreviations.

Rename columns with:

df.rename(columns={"Std_Scr": "score", "Std_Nm": "name"})

Drop columns you don’t need with:

df.drop(columns=["irrelevant_column"])

Cleaning up column names early makes everything that follows easier to read and less error-prone.

Putting it all together

Here’s a realistic cleaning workflow from start to finish:

import pandas as pd

# Load
df = pd.read_csv("students.csv")

# Inspect
print(df.shape)
print(df.dtypes)
print(df.isnull().sum())
print(df.duplicated().sum())

# Fix types
df["age"] = df["age"].astype(int)
df["joined"] = pd.to_datetime(df["joined"])

# Handle missing values
df["score"] = df["score"].fillna(df["score"].mean())
df = df.dropna(subset=["name"])

# Remove duplicates
df = df.drop_duplicates()

# Standardise text
df["city"] = df["city"].str.strip().str.lower()

# Rename columns
df = df.rename(columns={"Std_ID": "student_id"})

# Confirm
print(df.isnull().sum())
print(df.shape)

The final .isnull().sum() and .shape are a sanity check — confirming the dataset looks the way you expect before moving on.

Clean data is the foundation everything else depends on. The exploratory analysis, the visualizations, the conclusions you draw — they’re only as trustworthy as the data underneath them. Get into the habit of cleaning thoroughly before you do anything else, and your results will be worth believing.