How Git works under the hood

Every developer uses Git. Most use it as a sequence of commands — git add, git commit, git push — without a clear mental model of what’s actually happening on disk. That’s fine for everyday work, but when things go wrong (a botched rebase, a detached HEAD, a merge conflict that makes no sense), understanding the internals is what separates someone who can fix it from someone who has to nuke the repo and start again.

Git is not a file tracker — it’s a content-addressed database

The single most important thing to understand about Git is its data model. Git doesn’t track files. It tracks content. Specifically, it is a key-value store where every piece of data is stored under a key derived from a cryptographic hash (SHA-1, or SHA-256 in newer Git versions) of its content.

Every time you commit, Git takes a snapshot of your entire working tree — not a diff, not a delta — and stores it. If two files have identical content, they share the same object in Git’s store, regardless of their names. This is why Git is so space-efficient despite storing full snapshots: unchanged files from commit to commit are simply referenced again, not duplicated.

The four object types

Everything in a Git repository is one of four object types, all stored in .git/objects/:

Blob — the raw contents of a file. Nothing about the filename, permissions, or location. Just the bytes. Two files with the same content across your entire history share one blob.

Tree — a directory listing. It contains pointers to blobs (for files) and other trees (for subdirectories), along with filenames and permissions. This is where the filename and location information lives.

Commit — a snapshot of the repository at a point in time. It points to one tree (the root of the file tree at that moment), zero or more parent commits, and contains metadata: author, timestamp, and commit message.

Tag — a named pointer to a specific commit, with optional metadata. Tags are how version numbers like v2.0.1 are stored.

You can inspect any of these with git cat-file -p <hash> and see exactly what Git has stored.

Commits are a linked list (of trees)

A commit points to its parent commit. That parent points to its parent, and so on — all the way back to the initial commit, which has no parent. This forms a directed acyclic graph (DAG), not a linear sequence, which is what allows branching and merging.

When you look at git log, you’re traversing this graph backwards from your current commit.

Branches are just pointers

This surprises many developers: a branch in Git is literally nothing more than a file containing a 40-character SHA-1 hash. It’s a pointer to a commit. The file lives at .git/refs/heads/main (for the main branch).

When you make a new commit, Git creates the commit object, then updates the branch file to point to the new commit’s hash. Creating a new branch is instantaneous — Git just writes a new file with a hash in it. Deleting a branch just deletes that file. The commits themselves are unaffected.

HEAD is a special pointer that tells Git which branch you’re currently on. It typically contains the text ref: refs/heads/main. When you checkout a branch, HEAD is updated to point at it. A “detached HEAD” state means HEAD points directly to a commit hash rather than to a branch name.

The staging area (index)

Between your working directory and the commit history sits the index (also called the staging area). It’s a binary file at .git/index that represents the tree of the next commit you’re building.

When you run git add, Git hashes the file content, stores it as a blob object, and records that blob in the index with its filename and permissions. Nothing has been committed — you’ve just said “this version of this file belongs in my next snapshot.”

When you run git commit, Git reads the index, builds tree objects from it, creates a new commit object pointing to those trees, and updates the branch pointer. That’s the whole process.

Merging and rebasing: the real difference

Merging creates a new commit with two parent commits — one from each branch being merged. The history is preserved exactly as it happened. The downside is that in a busy repository, a merge-heavy history can become tangled and hard to read.

Rebasing takes the commits from one branch and replays them on top of another, one by one. Each replayed commit gets a new hash (since its parent has changed), so technically it’s a different commit. The result is a clean linear history — but you’ve rewritten it. This is why the golden rule is: never rebase commits that have already been pushed to a shared remote.

What `git push` actually does

When you push to a remote, Git sends only the objects the remote doesn’t already have — calculated by comparing the graph of commits you have against what the remote has. It then updates the remote branch pointer. This is why pushing to a remote with new commits from another user fails: your branch pointer and the remote’s have diverged, and Git won’t overwrite remote history by default. You have to pull first, resolving the divergence locally.

Why this matters

Understanding Git’s object model makes confusing situations legible. A “lost” commit isn’t gone — its object is still in .git/objects/ until garbage collection runs, and git reflog will show you every position HEAD has ever been at. Merge conflicts become less scary when you understand that Git is simply showing you two divergent sets of changes to the same file and asking you to decide which version of history is correct. And rebasing stops feeling like magic (or dark arts) once you see it as just replaying commits onto a new base.

Git rewards understanding. The commands are a thin interface over a beautifully simple data model — and that model repays the time spent learning it.