Logistic Regression

In the last unit you learned how linear regression finds a line of best fit to predict a number. The structure — features go in, a trained model comes out, .predict() returns something — should feel familiar by now.

This unit introduces your first classification algorithm: Logistic Regression. Despite the name, it doesn’t predict numbers. It predicts categories.

The problem

Suppose you want to predict whether a patient has diabetes based on their blood glucose level. Your target isn’t a number — it’s one of two outcomes: yes or no.

glucose | diabetes
--------|----------
85      | 0
140     | 1
172     | 1
95      | 0
210     | 1

You could try linear regression — draw a line through the data, threshold the output at 0.5, and call everything above it a 1. It sometimes works. But linear regression has no upper or lower bound. For extreme input values it’ll predict 1.8, or −0.3, which aren’t meaningful probabilities.

You need something that always outputs a number between 0 and 1.

The sigmoid function

Logistic regression solves this with one small change. Instead of outputting mx + b directly, it passes that value through a function called the sigmoid:

σ(z) = \frac {1}{1 + e^{(-z)}}

No matter what z is — large, small, negative — the sigmoid squashes it into the range (0, 1). The output can always be read as a probability.

Plot it and you get an S-shaped curve:

Inputs well below zero → output close to 0
Inputs well above zero → output close to 1
Input at zero → output exactly 0.5

That S-curve is where the name “logistic” comes from. The regression part is still there — you’re still fitting a linear combination of features — but the output is transformed into a probability.

From probability to prediction

The model outputs a probability, not a label. To get a prediction, you apply a threshold — usually 0.5:

if probability >= 0.5 → predict 1
if probability < 0.5  → predict 0

This threshold is a choice, not a law. In medical screening you might lower it to 0.3 — you’d rather flag some healthy patients than miss a sick one. You’ll revisit this when you study precision and recall.

Logistic regression in code

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
df = pd.read_csv("diabetes.csv")
X = df[["glucose"]]
y = df["diabetes"]

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, predictions):.2%}")

The structure is identical to every unit before it. The import changed. Nothing else did.

Back to accuracy — with a caveat

You’re predicting categories again, so accuracy is back as your metric. But accuracy can be misleading for classification problems.

Imagine a dataset where 95% of patients don’t have diabetes. A model that always predicts 0 — never flags anyone — achieves 95% accuracy while being completely useless.

This is called the class imbalance problem. Accuracy looks fine on the surface, but the model has learned nothing.

One way to catch this early: check how often each class appears in your data before you train anything.

print(y.value_counts())

If one class heavily outnumbers the other, you’ll need better evaluation metrics — precision, recall, and F1 — which you’ll cover in a later unit.

Inspecting the model

Just like linear regression, you can look inside:

print(f"Coefficient: {model.coef_[0][0]:.4f}")
print(f"Intercept:   {model.intercept_[0]:.4f}")

The coefficient tells you the direction: a positive value means higher glucose pushes the probability of diabetes upward. The relationship isn’t linear anymore — it follows that S-curve — but the sign and rough magnitude are still interpretable.

Multiple features

Logistic regression scales to multiple features the same way linear regression does:

python

X = df[["glucose", "bmi", "age"]]

Each feature gets its own coefficient. The model computes a weighted sum, passes it through the sigmoid, and returns a probability. The code doesn’t change.

Linear vs Logistic: what actually changed

	Linear Regression	Logistic Regression
Output	A number	A probability (0–1)
Final step	None	Sigmoid + threshold
Task	Regression	Classification
Metric	MAE	Accuracy (with caveats)

The underlying mechanics are nearly identical. That’s intentional — logistic regression is the natural bridge from regression to classification, and understanding both prepares you for algorithms where the internals are far less transparent.

What just happened?

Logistic regression takes a linear combination of features and passes it through a sigmoid function to produce a probability. A threshold converts that probability into a class prediction.

It’s fast, interpretable, and a strong baseline for binary classification. Like linear regression, if a more complex model can’t outperform it, that’s a signal worth investigating.

In the next unit you’ll meet Decision Trees properly — now with the vocabulary to understand what they’re actually doing.