Loading a dataset and running .head() on it can make data look deceptively tidy. But spend a few more minutes exploring and you’ll almost always find problems — rows with missing values, columns stored as the wrong type, duplicates, and inconsistencies that would silently corrupt your analysis if you didn’t catch them.
Data cleaning is the process of identifying and fixing these problems before you do anything else. It’s unglamorous work, but it’s the difference between analysis you can trust and analysis that leads you to the wrong conclusions.
Missing values in Pandas are represented as NaN — short for “Not a Number.” They show up when a cell in your dataset is empty or couldn’t be read.
You can find them with:
df.isnull() # True/False for every cell
df.isnull().sum() # count of missing values per column
.isnull().sum() is the one you’ll use most. It gives you a quick snapshot of where your missing data is:
name 0
age 3
score 1
city 7
dtype: int64
This tells you at a glance which columns have problems and how bad they are.
Once you know where the missing values are, you have two options: drop them or fill them.
df.dropna() # drop any row with at least one missing value
df.dropna(subset=["score"]) # drop only rows where score is missing
df.fillna(0) # replace all NaN with 0
df["score"].fillna(df["score"].mean()) # replace with the column mean
df["city"].fillna("Unknown") # replace with a placeholder
Which approach you choose depends on the situation. Dropping rows makes sense when only a small number are missing and losing them won’t skew your analysis. Filling makes sense when dropping would lose too much data, or when a sensible default value exists.
When Pandas loads a CSV it makes its best guess about what type each column is. It doesn’t always get it right.
Check your data types with:
df.dtypes
A common problem is a column of numbers being read as text (called object in Pandas):
name object
age object ← should be int
score float64
joined object ← should be datetime
Fix it with .astype():
df["age"] = df["age"].astype(int)
For dates, use pd.to_datetime():
df["joined"] = pd.to_datetime(df["joined"])
Getting types right matters because Pandas won’t let you do math on a column it thinks is text, and date operations won’t work until the column is actually a datetime type.
Duplicate rows sneak into datasets more often than you’d expect — data exports, merges, and manual entry are all common sources.
Find them with:
df.duplicated().sum() # how many duplicate rows exist
Remove them with:
df.drop_duplicates()
You can also check for duplicates based on specific columns only:
df.drop_duplicates(subset=["name", "score"])
This removes rows where both the name and score are identical, even if other columns differ.
Text data is especially prone to inconsistency. The same value spelled different ways, extra spaces, mixed capitalisation — these all cause problems when you try to filter or group your data.
df["city"].unique()
# ['New York', 'new york', 'NEW YORK', ' New York']
These are all the same city, but Pandas treats them as four different values. Fix it by standardising:
df["city"] = df["city"].str.strip() # remove leading/trailing spaces
df["city"] = df["city"].str.lower() # make everything lowercase
Or combine them:
df["city"] = df["city"].str.strip().str.lower()
Now all four variants become "new york" and your data is consistent.
Datasets often come with column names that are awkward to work with — spaces, inconsistent capitalisation, cryptic abbreviations.
Rename columns with:
df.rename(columns={"Std_Scr": "score", "Std_Nm": "name"})
Drop columns you don’t need with:
df.drop(columns=["irrelevant_column"])
Cleaning up column names early makes everything that follows easier to read and less error-prone.
Here’s a realistic cleaning workflow from start to finish:
import pandas as pd
# Load
df = pd.read_csv("students.csv")
# Inspect
print(df.shape)
print(df.dtypes)
print(df.isnull().sum())
print(df.duplicated().sum())
# Fix types
df["age"] = df["age"].astype(int)
df["joined"] = pd.to_datetime(df["joined"])
# Handle missing values
df["score"] = df["score"].fillna(df["score"].mean())
df = df.dropna(subset=["name"])
# Remove duplicates
df = df.drop_duplicates()
# Standardise text
df["city"] = df["city"].str.strip().str.lower()
# Rename columns
df = df.rename(columns={"Std_ID": "student_id"})
# Confirm
print(df.isnull().sum())
print(df.shape)
The final .isnull().sum() and .shape are a sanity check — confirming the dataset looks the way you expect before moving on.
Clean data is the foundation everything else depends on. The exploratory analysis, the visualizations, the conclusions you draw — they’re only as trustworthy as the data underneath them. Get into the habit of cleaning thoroughly before you do anything else, and your results will be worth believing.