In the last unit you learned how linear regression finds a line of best fit to predict a number. The structure — features go in, a trained model comes out, .predict() returns something — should feel familiar by now.
This unit introduces your first classification algorithm: Logistic Regression. Despite the name, it doesn’t predict numbers. It predicts categories.
Suppose you want to predict whether a patient has diabetes based on their blood glucose level. Your target isn’t a number — it’s one of two outcomes: yes or no.
glucose | diabetes
--------|----------
85 | 0
140 | 1
172 | 1
95 | 0
210 | 1
You could try linear regression — draw a line through the data, threshold the output at 0.5, and call everything above it a 1. It sometimes works. But linear regression has no upper or lower bound. For extreme input values it’ll predict 1.8, or −0.3, which aren’t meaningful probabilities.
You need something that always outputs a number between 0 and 1.
Logistic regression solves this with one small change. Instead of outputting mx + b directly, it passes that value through a function called the sigmoid:
No matter what z is — large, small, negative — the sigmoid squashes it into the range (0, 1). The output can always be read as a probability.
Plot it and you get an S-shaped curve:
That S-curve is where the name “logistic” comes from. The regression part is still there — you’re still fitting a linear combination of features — but the output is transformed into a probability.
The model outputs a probability, not a label. To get a prediction, you apply a threshold — usually 0.5:
if probability >= 0.5 → predict 1
if probability < 0.5 → predict 0
This threshold is a choice, not a law. In medical screening you might lower it to 0.3 — you’d rather flag some healthy patients than miss a sick one. You’ll revisit this when you study precision and recall.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load data
df = pd.read_csv("diabetes.csv")
X = df[["glucose"]]
y = df["diabetes"]
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict
predictions = model.predict(X_test)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, predictions):.2%}")
The structure is identical to every unit before it. The import changed. Nothing else did.
You’re predicting categories again, so accuracy is back as your metric. But accuracy can be misleading for classification problems.
Imagine a dataset where 95% of patients don’t have diabetes. A model that always predicts 0 — never flags anyone — achieves 95% accuracy while being completely useless.
This is called the class imbalance problem. Accuracy looks fine on the surface, but the model has learned nothing.
One way to catch this early: check how often each class appears in your data before you train anything.
print(y.value_counts())
If one class heavily outnumbers the other, you’ll need better evaluation metrics — precision, recall, and F1 — which you’ll cover in a later unit.
Just like linear regression, you can look inside:
print(f"Coefficient: {model.coef_[0][0]:.4f}")
print(f"Intercept: {model.intercept_[0]:.4f}")
The coefficient tells you the direction: a positive value means higher glucose pushes the probability of diabetes upward. The relationship isn’t linear anymore — it follows that S-curve — but the sign and rough magnitude are still interpretable.
Logistic regression scales to multiple features the same way linear regression does:
python
X = df[["glucose", "bmi", "age"]]
Each feature gets its own coefficient. The model computes a weighted sum, passes it through the sigmoid, and returns a probability. The code doesn’t change.
| Linear Regression | Logistic Regression | |
|---|---|---|
| Output | A number | A probability (0–1) |
| Final step | None | Sigmoid + threshold |
| Task | Regression | Classification |
| Metric | MAE | Accuracy (with caveats) |
The underlying mechanics are nearly identical. That’s intentional — logistic regression is the natural bridge from regression to classification, and understanding both prepares you for algorithms where the internals are far less transparent.
Logistic regression takes a linear combination of features and passes it through a sigmoid function to produce a probability. A threshold converts that probability into a class prediction.
It’s fast, interpretable, and a strong baseline for binary classification. Like linear regression, if a more complex model can’t outperform it, that’s a signal worth investigating.
In the next unit you’ll meet Decision Trees properly — now with the vocabulary to understand what they’re actually doing.