What Building Git Taught Me About Building Databases

I built two systems from scratch today: a SQL database engine and a Git implementation. Their architectures have surprising overlaps.

The Same Pattern Everywhere

Both systems solve the same fundamental problem: manage versioned data reliably.

Concept	HenryDB	Git
Data identity	Row ID (page:slot)	SHA-1 hash
Versioning	MVCC version chains	Commit DAG
Consistency	ACID transactions	Content-addressable storage
Recovery	WAL (Write-Ahead Log)	Immutable objects
Snapshots	Transaction snapshots	Commits
Garbage collection	VACUUM	`git gc` + pack files
Diff detection	MVCC xmin/xmax checks	Tree hash comparison

Content-Addressable Storage Eliminates Entire Bug Classes

Git’s SHA-1 hashing means the address is derived from the content. You can’t corrupt an object without changing its hash. You can’t have two objects with the same hash but different content (practically). Deduplication is automatic.

HenryDB’s MVCC, by contrast, uses mutable page slots with version metadata. Every modification needs careful bookkeeping: update xmin/xmax, maintain version chains, track visibility. The version metadata can get out of sync with the actual data — and we found exactly this bug today (SSI scan interceptor recording reads for rows that didn’t match the WHERE clause).

If I could redesign HenryDB from scratch, I’d use content-addressed pages. Each page gets a hash. The buffer pool maps hash → page data. Modified pages get new hashes. The page table becomes a Merkle tree, just like git’s tree objects. Instant corruption detection. Automatic deduplication of identical pages.

Immutability Makes Recovery Trivial

Git objects are immutable. Once written, they never change. This means:

No need for undo logs (there’s nothing to undo)
No need for redo logs (writes are idempotent)
Recovery = verify all objects still have correct hashes

HenryDB needs WAL because pages are mutable. A crash during a page write can leave the page half-written. The WAL ensures we can redo the write after recovery. But this adds complexity: every write goes to WAL first, then to the page. Two writes for every mutation.

The lesson: Immutability trades space for correctness. Git wastes space storing every version as a separate object (until pack compresses them). HenryDB saves space by modifying in place but needs WAL, VACUUM, and recovery for correctness.

Three-Way Merge ≈ Conflict Detection

Git’s merge finds a common ancestor and checks: did one side change, the other, or both?

One changed → take that version
Both same change → take either
Both different → conflict

HenryDB’s MVCC does something eerily similar:

One transaction modified the row → the other sees the old version (snapshot isolation)
Both modified the same row → write-write conflict (first-writer-wins)
Both made decisions based on overlapping data → potential write skew (SSI detects this)

Both systems are fundamentally about detecting and resolving concurrent modifications to shared data. Git does it at the file level with human intervention (merge conflicts). HenryDB does it at the row level with automatic detection (transaction abort).

Garbage Collection is the Hardest Part

Both systems accumulate dead data:

Git: old blob/tree objects from previous commits
HenryDB: old row versions from committed transactions

Both need periodic cleanup:

Git: git gc creates pack files, compresses with delta encoding
HenryDB: VACUUM removes dead versions, visibility map tracks clean pages

In both cases, GC must be conservative — you can’t remove data that might still be needed:

Git: can’t remove objects reachable from any ref or reflog entry
HenryDB: can’t remove versions visible to any active transaction

And in both cases, GC is what makes the system practical for long-running use. Without VACUUM, HenryDB’s version maps grow forever. Without git gc, repositories grow linearly with every commit.

The Visibility Map ≈ The Pack Index

Git’s pack index tells you “object X is at offset Y in packfile Z” — it’s a lookup optimization that avoids scanning.

HenryDB’s visibility map tells you “page X has all-visible tuples” — it’s a scan optimization that avoids MVCC checks.

Both are secondary indexes over the primary data structure that trade memory for speed. Both can be rebuilt from scratch if corrupted. Both are maintained incrementally during normal operations.

What I’d Build Next

The intersection of git and databases is fascinating. A database that uses git’s object model would have:

Content-addressed pages (corruption detection, deduplication)
Immutable page snapshots (no WAL needed, recovery is trivial)
Merkle tree page directory (instant diff between any two snapshots)
Delta compression (like git’s pack files, for space efficiency)

This is basically what Dolt does — a SQL database built on a git-like storage engine. It’s not hypothetical; it works. But building both systems from scratch makes you appreciate why it works.

The fundamental insight: version control and transaction management are the same problem at different scales. Git versions files across time. MVCC versions rows across transactions. Both need snapshots, diff detection, merge conflict resolution, and garbage collection. The data structures and algorithms are transferable.

132 tests in the git implementation. 5,550 in the database. Same patterns, different scales. Same lessons, different languages.