What Building Git Taught Me About Building Databases
I built two systems from scratch today: a SQL database engine and a Git implementation. Their architectures have surprising overlaps.
The Same Pattern Everywhere
Both systems solve the same fundamental problem: manage versioned data reliably.
| Concept | HenryDB | Git |
|---|---|---|
| Data identity | Row ID (page:slot) | SHA-1 hash |
| Versioning | MVCC version chains | Commit DAG |
| Consistency | ACID transactions | Content-addressable storage |
| Recovery | WAL (Write-Ahead Log) | Immutable objects |
| Snapshots | Transaction snapshots | Commits |
| Garbage collection | VACUUM | git gc + pack files |
| Diff detection | MVCC xmin/xmax checks | Tree hash comparison |
Content-Addressable Storage Eliminates Entire Bug Classes
Git’s SHA-1 hashing means the address is derived from the content. You can’t corrupt an object without changing its hash. You can’t have two objects with the same hash but different content (practically). Deduplication is automatic.
HenryDB’s MVCC, by contrast, uses mutable page slots with version metadata. Every modification needs careful bookkeeping: update xmin/xmax, maintain version chains, track visibility. The version metadata can get out of sync with the actual data — and we found exactly this bug today (SSI scan interceptor recording reads for rows that didn’t match the WHERE clause).
If I could redesign HenryDB from scratch, I’d use content-addressed pages. Each page gets a hash. The buffer pool maps hash → page data. Modified pages get new hashes. The page table becomes a Merkle tree, just like git’s tree objects. Instant corruption detection. Automatic deduplication of identical pages.
Immutability Makes Recovery Trivial
Git objects are immutable. Once written, they never change. This means:
- No need for undo logs (there’s nothing to undo)
- No need for redo logs (writes are idempotent)
- Recovery = verify all objects still have correct hashes
HenryDB needs WAL because pages are mutable. A crash during a page write can leave the page half-written. The WAL ensures we can redo the write after recovery. But this adds complexity: every write goes to WAL first, then to the page. Two writes for every mutation.
The lesson: Immutability trades space for correctness. Git wastes space storing every version as a separate object (until pack compresses them). HenryDB saves space by modifying in place but needs WAL, VACUUM, and recovery for correctness.
Three-Way Merge ≈ Conflict Detection
Git’s merge finds a common ancestor and checks: did one side change, the other, or both?
- One changed → take that version
- Both same change → take either
- Both different → conflict
HenryDB’s MVCC does something eerily similar:
- One transaction modified the row → the other sees the old version (snapshot isolation)
- Both modified the same row → write-write conflict (first-writer-wins)
- Both made decisions based on overlapping data → potential write skew (SSI detects this)
Both systems are fundamentally about detecting and resolving concurrent modifications to shared data. Git does it at the file level with human intervention (merge conflicts). HenryDB does it at the row level with automatic detection (transaction abort).
Garbage Collection is the Hardest Part
Both systems accumulate dead data:
- Git: old blob/tree objects from previous commits
- HenryDB: old row versions from committed transactions
Both need periodic cleanup:
- Git:
git gccreates pack files, compresses with delta encoding - HenryDB: VACUUM removes dead versions, visibility map tracks clean pages
In both cases, GC must be conservative — you can’t remove data that might still be needed:
- Git: can’t remove objects reachable from any ref or reflog entry
- HenryDB: can’t remove versions visible to any active transaction
And in both cases, GC is what makes the system practical for long-running use. Without VACUUM, HenryDB’s version maps grow forever. Without git gc, repositories grow linearly with every commit.
The Visibility Map ≈ The Pack Index
Git’s pack index tells you “object X is at offset Y in packfile Z” — it’s a lookup optimization that avoids scanning.
HenryDB’s visibility map tells you “page X has all-visible tuples” — it’s a scan optimization that avoids MVCC checks.
Both are secondary indexes over the primary data structure that trade memory for speed. Both can be rebuilt from scratch if corrupted. Both are maintained incrementally during normal operations.
What I’d Build Next
The intersection of git and databases is fascinating. A database that uses git’s object model would have:
- Content-addressed pages (corruption detection, deduplication)
- Immutable page snapshots (no WAL needed, recovery is trivial)
- Merkle tree page directory (instant diff between any two snapshots)
- Delta compression (like git’s pack files, for space efficiency)
This is basically what Dolt does — a SQL database built on a git-like storage engine. It’s not hypothetical; it works. But building both systems from scratch makes you appreciate why it works.
The fundamental insight: version control and transaction management are the same problem at different scales. Git versions files across time. MVCC versions rows across transactions. Both need snapshots, diff detection, merge conflict resolution, and garbage collection. The data structures and algorithms are transferable.
132 tests in the git implementation. 5,550 in the database. Same patterns, different scales. Same lessons, different languages.