How Git works under the hood

Every developer uses Git. Most use it as a sequence of commands — git add, git commit, git push — without a clear mental model of what’s actually happening on disk. That’s fine for everyday work, but when things go wrong (a botched rebase, a detached HEAD, a merge conflict that makes no sense), understanding the internals is what separates someone who can fix it from someone who has to nuke the repo and start again.

Git is not a file tracker — it’s a content-addressed database

The single most important thing to understand about Git is its data model. Git doesn’t track files. It tracks content. Specifically, it is a key-value store where every piece of data is stored under a key derived from a cryptographic hash (SHA-1, or SHA-256 in newer Git versions) of its content.

Every time you commit, Git takes a snapshot of your entire working tree — not a diff, not a delta — and stores it. If two files have identical content, they share the same object in Git’s store, regardless of their names. This is why Git is so space-efficient despite storing full snapshots: unchanged files from commit to commit are simply referenced again, not duplicated.

The four object types

Everything in a Git repository is one of four object types, all stored in .git/objects/:

Blob — the raw contents of a file. Nothing about the filename, permissions, or location. Just the bytes. Two files with the same content across your entire history share one blob.

Tree — a directory listing. It contains pointers to blobs (for files) and other trees (for subdirectories), along with filenames and permissions. This is where the filename and location information lives.

Commit — a snapshot of the repository at a point in time. It points to one tree (the root of the file tree at that moment), zero or more parent commits, and contains metadata: author, timestamp, and commit message.

Tag — a named pointer to a specific commit, with optional metadata. Tags are how version numbers like v2.0.1 are stored.

You can inspect any of these with git cat-file -p <hash> and see exactly what Git has stored.

Commits are a linked list (of trees)

A commit points to its parent commit. That parent points to its parent, and so on — all the way back to the initial commit, which has no parent. This forms a directed acyclic graph (DAG), not a linear sequence, which is what allows branching and merging.

When you look at git log, you’re traversing this graph backwards from your current commit.

Branches are just pointers

This surprises many developers: a branch in Git is literally nothing more than a file containing a 40-character SHA-1 hash. It’s a pointer to a commit. The file lives at .git/refs/heads/main (for the main branch).

When you make a new commit, Git creates the commit object, then updates the branch file to point to the new commit’s hash. Creating a new branch is instantaneous — Git just writes a new file with a hash in it. Deleting a branch just deletes that file. The commits themselves are unaffected.

HEAD is a special pointer that tells Git which branch you’re currently on. It typically contains the text ref: refs/heads/main. When you checkout a branch, HEAD is updated to point at it. A “detached HEAD” state means HEAD points directly to a commit hash rather than to a branch name.

The staging area (index)

Between your working directory and the commit history sits the index (also called the staging area). It’s a binary file at .git/index that represents the tree of the next commit you’re building.

When you run git add, Git hashes the file content, stores it as a blob object, and records that blob in the index with its filename and permissions. Nothing has been committed — you’ve just said “this version of this file belongs in my next snapshot.”

When you run git commit, Git reads the index, builds tree objects from it, creates a new commit object pointing to those trees, and updates the branch pointer. That’s the whole process.

Merging and rebasing: the real difference

Merging creates a new commit with two parent commits — one from each branch being merged. The history is preserved exactly as it happened. The downside is that in a busy repository, a merge-heavy history can become tangled and hard to read.

Rebasing takes the commits from one branch and replays them on top of another, one by one. Each replayed commit gets a new hash (since its parent has changed), so technically it’s a different commit. The result is a clean linear history — but you’ve rewritten it. This is why the golden rule is: never rebase commits that have already been pushed to a shared remote.

What git push actually does

When you push to a remote, Git sends only the objects the remote doesn’t already have — calculated by comparing the graph of commits you have against what the remote has. It then updates the remote branch pointer. This is why pushing to a remote with new commits from another user fails: your branch pointer and the remote’s have diverged, and Git won’t overwrite remote history by default. You have to pull first, resolving the divergence locally.

Why this matters

Understanding Git’s object model makes confusing situations legible. A “lost” commit isn’t gone — its object is still in .git/objects/ until garbage collection runs, and git reflog will show you every position HEAD has ever been at. Merge conflicts become less scary when you understand that Git is simply showing you two divergent sets of changes to the same file and asking you to decide which version of history is correct. And rebasing stops feeling like magic (or dark arts) once you see it as just replaying commits onto a new base.

Git rewards understanding. The commands are a thin interface over a beautifully simple data model — and that model repays the time spent learning it.

How transformers actually work — the architecture behind every modern AI

You’ve probably used a transformer today. ChatGPT, Claude, Gemini, GitHub Copilot, Google Translate — every one of them is built on the same foundational architecture, first described in a 2017 paper titled “Attention Is All You Need.” It’s one of the most influential pieces of computer science research in decades. But what does a transformer actually do?

The problem transformers solved

Before transformers, the dominant approach to processing language was recurrent neural networks (RNNs). An RNN reads a sentence word by word, left to right, maintaining a kind of rolling memory of what it’s seen so far. The problem is that this memory fades. By the time an RNN reaches the end of a long paragraph, the information from the beginning has been diluted through dozens of processing steps. Long-range dependencies — like connecting a pronoun at the end of a sentence to the noun it refers to at the beginning — were genuinely difficult.

Transformers abandoned the sequential, word-by-word approach entirely.

Attention: looking at everything at once

The core mechanism of a transformer is called self-attention. Instead of processing words in order, the model looks at every word in the input simultaneously and figures out which words should pay attention to which other words.

Take the sentence: “The animal didn’t cross the street because it was too tired.”

What does “it” refer to — the animal or the street? For a human, this is obvious. For a machine, it requires connecting the pronoun to a noun that might be many positions away. Self-attention solves this by computing, for every word, a score of how relevant every other word in the sentence is to understanding it. “It” ends up attending strongly to “animal” and weakly to “street.”

These attention scores are calculated using three learned vectors for each word: a Query (what am I looking for?), a Key (what do I offer?), and a Value (what information do I actually contain?). The similarity between a Query and all the Keys in the sequence determines how much attention is paid to each word’s Value.

Multiple heads, multiple perspectives

A transformer doesn’t run attention just once — it runs it several times in parallel, each with different learned weights. These are called attention heads. One head might learn to track grammatical agreement between subjects and verbs. Another might learn to connect pronouns to their referents. Another might focus on sentiment. Each head sees the same input but learns to look for something different. Their outputs are combined.

This is called multi-head attention, and it’s what gives transformers such rich, nuanced representations of language.

Layers, feedforward networks, and depth

After the attention mechanism, each position in the sequence passes through a simple feedforward neural network — the same network applied to every position independently. Then the output feeds into the next layer, which runs attention again, and then another feedforward network, and so on.

Modern large language models stack dozens or hundreds of these layers. GPT-4 is estimated to have around 120. Each layer refines the representation — early layers tend to capture surface-level patterns, while deeper layers capture abstract reasoning and semantic relationships.

Positional encoding: putting words back in order

Since transformers process all positions simultaneously, they don’t inherently know that word 3 comes before word 7. To preserve order information, a positional encoding is added to each word’s representation before processing begins — a mathematical signal that encodes position. This is how the model knows “the dog bit the man” and “the man bit the dog” are different, even though they contain the same words.

Why transformers scale so well

One of the most important properties of transformers — and the reason they’ve come to dominate AI — is that they scale predictably. The more parameters and compute you throw at them, the better they get, in a remarkably consistent way. This observation, sometimes called scaling laws, was documented by OpenAI researchers in 2020 and triggered an arms race in model size that’s still ongoing.

RNNs never showed this property clearly. Transformers did, and that made them the architecture of choice for any serious language model — and increasingly for vision, audio, and multimodal systems as well.

The limits

Transformers aren’t without problems. The attention mechanism’s computational cost scales quadratically with sequence length — doubling the input roughly quadruples the compute required. This makes very long context windows expensive, which is why enormous effort has gone into alternatives like sparse attention, sliding window attention, and state-space models (like Mamba) that try to preserve transformer-level quality at lower cost for long sequences.

There’s also the question of whether attention is really “understanding” in any meaningful sense, or a sophisticated pattern-matching system that has learned to fake it at scale. That debate continues — and it’s one of the most interesting open questions in AI research today.

Further reading

  • Vaswani et al., “Attention Is All You Need” (2017) — the original paper
  • Kaplan et al., “Scaling Laws for Neural Language Models” (2020)
  • Illustrated Transformer by Jay Alammar — an excellent visual walkthrough