Building a SQL Parser from Scratch in JavaScript
HenryDB’s SQL parser handles 250+ SQL features in about 1,500 lines of JavaScript. No parser generators, no external dependencies. Here’s how it works and what I learned building it.
Architecture: Three Stages
SQL string → Tokenizer → Token stream → Parser → AST → Executor → Results
Stage 1: Tokenizer
The tokenizer converts raw SQL into tokens. It’s surprisingly simple — just a while loop:
function tokenize(sql) {
const tokens = [];
let i = 0;
while (i < sql.length) {
// Skip whitespace
if (/\s/.test(sql[i])) { i++; continue; }
// Numbers
if (/\d/.test(sql[i])) {
let num = '';
while (i < sql.length && /[\d.]/.test(sql[i])) num += sql[i++];
tokens.push({ type: 'NUMBER', value: parseFloat(num) });
continue;
}
// Strings
if (sql[i] === "'") {
i++; // skip opening quote
let str = '';
while (sql[i] !== "'") str += sql[i++];
i++; // skip closing quote
tokens.push({ type: 'STRING', value: str });
continue;
}
// Keywords and identifiers
if (/[a-zA-Z_]/.test(sql[i])) {
let ident = '';
while (i < sql.length && /[a-zA-Z0-9_.]/.test(sql[i])) ident += sql[i++];
const upper = ident.toUpperCase();
if (KEYWORDS.has(upper)) {
tokens.push({ type: 'KEYWORD', value: upper });
} else {
tokens.push({ type: 'IDENT', value: ident });
}
continue;
}
// Operators, parens, etc.
tokens.push({ type: 'SYMBOL', value: sql[i++] });
}
return tokens;
}
The tricky parts:
- Qualified identifiers:
table.columnbecomes one token (the.is included) - Qualified star:
table.*needs special detection at tokenize time - String escaping: Single quotes inside strings use
''(double-single-quote) - Keywords vs identifiers:
SELECTis a keyword,select_countis an identifier
Stage 2: Parser (Recursive Descent)
The parser is a textbook recursive descent parser. Each SQL clause gets its own function:
function parseSelectStatement() {
expect('SELECT');
const distinct = match('DISTINCT');
const columns = parseSelectList();
let from = null;
if (isKeyword('FROM')) {
advance();
from = parseFrom();
}
let where = null;
if (isKeyword('WHERE')) {
advance();
where = parseExpression();
}
// ... GROUP BY, HAVING, WINDOW, ORDER BY, LIMIT, OFFSET
return { type: 'SELECT', columns, from, where, ... };
}
The hardest parts to get right:
1. Expression parsing with precedence. 2 + 3 * 4 must evaluate to 14, not 20. I use Pratt parsing (operator precedence climbing):
function parseExpression(minPrec = 0) {
let left = parsePrimary();
while (peek() is an operator with precedence >= minPrec) {
const op = advance();
const right = parseExpression(precedenceOf(op) + 1);
left = { type: 'binary', op, left, right };
}
return left;
}
2. Ambiguous keywords. AS can be an alias or part of CREATE TABLE AS SELECT. IN can be WHERE x IN (1,2) or WHERE x IN (SELECT ...). Context determines meaning.
3. SELECT column types. A column in the SELECT list could be:
- A bare column name:
name - A table-qualified column:
users.name - An expression:
price * quantity - A function:
COUNT(*) - An aggregate:
SUM(amount) - A window function:
ROW_NUMBER() OVER (...) - A subquery:
(SELECT MAX(id) FROM t) - A CASE expression:
CASE WHEN ... THEN ... END
All of these need to be detected and parsed differently.
Stage 3: AST → Execution
The AST is a plain JavaScript object tree. The executor walks it recursively:
execute(ast) {
switch (ast.type) {
case 'SELECT': return this._select(ast);
case 'INSERT': return this._insert(ast);
case 'CREATE_TABLE': return this._createTable(ast);
// ...
}
}
Lessons Learned
1. Start with the easy cases. SELECT * FROM t is much simpler than SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d) FROM t GROUP BY a HAVING COUNT(*) > 1. Get the simple case working first.
2. The parser is 30% of the work, the executor is 70%. Parsing GROUP BY is trivial. Implementing it correctly (hash grouping, aggregate evaluation, HAVING filter, alias resolution) is where the complexity lives.
3. Test early, test weird. The bugs I found weren’t in obvious queries. They were in edge cases:
SELECT 42 as b FROM table— the42was parsed as a column referenceSELECT a+1, b+1 FROM t— both unnamed expressions got the keyexpr, second overwrote firstGROUP BY classification— aliases weren’t resolved to their CASE expressions
4. SQL is surprisingly regular. Despite its reputation for being complex, SQL has a very consistent structure: verb ... FROM ... WHERE ... GROUP BY ... HAVING ... ORDER BY ... LIMIT. Once you nail this skeleton, adding features is incremental.
5. The tokenizer matters more than you think. Bugs in tokenization cascade into impossible-to-debug parser errors. Getting table.* right required special tokenizer handling — the parser alone couldn’t distinguish it from multiplication.
Stats
HenryDB’s parser:
- ~1,500 lines of JavaScript
- ~150 SQL keywords recognized
- Handles: SELECT, INSERT, UPDATE, DELETE, CREATE TABLE/INDEX/VIEW, ALTER TABLE, DROP, WITH (RECURSIVE), EXPLAIN, SHOW, TRUNCATE, UPSERT
- Passes 250/250 SQL compliance checks
- Generates AST that’s directly executable
No parser generator needed. Recursive descent + operator precedence climbing handles everything SQL throws at it.
HenryDB is a SQL database written from scratch in JavaScript. Source on GitHub.