Pandas - INCS

NumPy arrays are great for numerical operations, but real-world data rarely comes as a clean grid of numbers. A dataset might have column names, text, dates, missing values, and dozens of different types of information all mixed together.

Pandas was built for exactly this. It gives you a structure called a DataFrame — think of it as a spreadsheet inside your Python code. Each column has a name, each row has an index, and you can load, explore, filter, and manipulate your data with clean, readable code.

If NumPy is the engine, Pandas is the cockpit. It’s where you’ll spend most of your time as a data scientist.

The DataFrame

A DataFrame is a table. It has rows and columns, just like a spreadsheet.

You can create one manually from a dictionary:

import pandas as pd

data = {
    "name": ["Alice", "Bob", "Carol"],
    "score": [85, 72, 90],
    "passed": [True, True, True]
}

df = pd.DataFrame(data)
print(df)

Output:

    name  score  passed
0  Alice     85    True
1    Bob     72    True
2  Carol     90    True

Each key in the dictionary becomes a column. Each value is a list of entries for that column. The numbers on the left (0, 1, 2) are the index — Pandas assigns these automatically.

The other key structure in Pandas is a Series — a single column on its own. A DataFrame is essentially a collection of Series sitting side by side.

Loading data from a CSV

You’ll rarely create DataFrames by hand. Most of the time you’ll load data from a file. The most common format is CSV (comma-separated values), and Pandas makes this one line:

df = pd.read_csv("students.csv")

That’s it. Pandas reads the file, uses the first row as column names, and gives you a DataFrame ready to work with.

Exploring a dataset

The first thing you do with any new dataset is get a feel for what’s in it. Pandas gives you several tools for this:

df.head()     # first 5 rows
df.tail()     # last 5 rows
df.shape      # (rows, columns) — e.g. (150, 5)
df.columns    # list of column names
df.info()     # column names, data types, missing value counts
df.describe() # summary statistics for numerical columns

Get into the habit of running these every time you load a new dataset. They’ll tell you immediately if something looks wrong — unexpected data types, missing values, columns that don’t make sense.

Selecting columns and rows

Selecting a single column returns a Series:

df[“score”]

Selecting multiple columns returns a DataFrame:

df[[“name”, “score”]]

Selecting rows by index uses .iloc[] (integer location):

df.iloc[0] # first row
df.iloc[0:3] # first three rows

Selecting rows by label uses .loc[]:

df.loc[0] # row with index 0
df.loc[0:2] # rows with index 0 through 2

.iloc thinks in positions. .loc thinks in labels. For most beginners .iloc is more intuitive at first, but both are worth knowing.

Filtering rows

Remember boolean masking from the NumPy unit? The exact same idea applies in Pandas:

df[df["score"] > 75]

This returns only the rows where the score column is greater than 75. You can combine conditions the same way:

df[(df["score"] > 75) & (df["passed"] == True)]

Filtering is one of the most frequent things you’ll do with a dataset — isolating the rows that meet a certain condition before doing further analysis.

Common column operations

You can do math on an entire column at once, just like NumPy arrays:

df["score"] + 5        # add 5 to every score
df["score"] * 1.1      # increase every score by 10%

You can create a new column from existing ones:

df["score_out_of_50"] = df["score"] / 2

You can also use aggregate functions on a column:

df["score"].mean()
df["score"].max()
df["score"].min()
df["score"].sum()

And you can sort the DataFrame by a column:

df.sort_values("score")               # ascending
df.sort_values("score", ascending=False)  # descending

Putting it all together

Here’s a small end-to-end example using a student dataset:

import pandas as pd

# Load the data
df = pd.read_csv("students.csv")

# Get a feel for the data
print(df.shape)
print(df.info())
print(df.describe())

# See the top 5 rows
print(df.head())

# Filter to passing students only
passing = df[df["score"] >= 60]

# Add a grade column
df["grade"] = df["score"] / 100

# Find the top scorer
print("Top score:", df["score"].max())
print("Average score:", df["score"].mean())

# Sort by score descending
df_sorted = df.sort_values("score", ascending=False)
print(df_sorted.head())

Pandas is the tool you’ll reach for more than anything else in this course. Everything from here — cleaning data, exploratory analysis, visualization — runs through it. The concepts you practiced here will feel like second nature by the time the module is done.