Real World OCaml

2nd Edition (Oct 2022)
Back
Table of Contents

Parsing with OCamllex and Menhir

This chapter includes contributions from Jason Hickey.

Many programming tasks start with the interpretation of some form of structured textual data. Parsing is the process of converting such data into data structures that are easy to program against. For simple formats, it’s often enough to parse the data in an ad hoc way, say, by breaking up the data into lines, and then using regular expressions for breaking those lines down into their component pieces.

But this simplistic approach tends to fall down when parsing more complicated data, particularly data with the kind of recursive structure you find in full-blown programming languages or flexible data formats like JSON and XML. Parsing such formats accurately and efficiently while providing useful error messages is a complex task.

Often, you can find an existing parsing library that handles these issues for you. But there are tools to simplify the task when you do need to write a parser, in the form of parser generators. A parser generator creates a parser from a specification of the data format that you want to parse, and uses that to generate a parser.  

Parser generators have a long history, including tools like lex and yacc that date back to the early 1970s. OCaml has its own alternatives, including ocamllex, which replaces lex, and ocamlyacc and menhir, which replace yacc. We’ll explore these tools in the course of walking through the implementation of a parser for the JSON serialization format that we discussed in Chapter 18, Handling Json Data.

Parsing is a broad and often intricate topic, and our purpose here is not to teach all of the theoretical issues, but to provide a pragmatic introduction of how to build a parser in OCaml.   

Menhir Versus ocamlyacc

Menhir is an alternative parser generator that is generally superior to the venerable ocamlyacc, which dates back quite a few years. Menhir is mostly compatible with ocamlyacc grammars, and so you can usually just switch to Menhir and expect older code to work (with some minor differences described in the Menhir manual).

The biggest advantage of Menhir is that its error messages are generally more human-comprehensible, and the parsers that it generates are fully reentrant and can be parameterized in OCaml modules more easily. We recommend that any new code you develop should use Menhir instead of ocamlyacc.

Menhir isn’t distributed directly with OCaml but is available through OPAM by running opam install menhir.

Lexing and Parsing

Parsing is traditionally broken down into two parts: lexical analysis, which is a kind of simplified parsing phase that converts a stream of characters into a stream of logical tokens; and full-on parsing, which involves converting a stream of tokens into the final representation, which is often in the form of a tree-like data structure called an abstract syntax tree, or AST.   

It’s confusing that the term parsing is applied to both the overall process of converting textual data to structured data, and also more specifically to the second phase of converting a stream of tokens to an AST; so from here on out, we’ll use the term parsing to refer only to this second phase.

Let’s consider lexing and parsing in the context of the JSON format. Here’s a snippet of text that represents a JSON object containing a string labeled title and an array containing two objects, each with a name and array of zip codes:

{
  "title": "Cities",
  "cities": [
    { "name": "Chicago",  "zips": [60601] },
    { "name": "New York", "zips": [10004] }
  ]
}

At a syntactic level, we can think of a JSON file as a series of simple logical units, like curly braces, square brackets, commas, colons, identifiers, numbers, and quoted strings. Thus, we could represent our JSON text as a sequence of tokens of the following type:

type token =
  | NULL
  | TRUE
  | FALSE
  | STRING of string
  | INT of int
  | FLOAT of float
  | LEFT_BRACK
  | RIGHT_BRACK
  | LEFT_BRACE
  | RIGHT_BRACE
  | COMMA
  | COLON
  | EOF

Note that this representation loses some information about the original text. For example, whitespace is not represented. It’s common, and indeed useful, for the token stream to forget some details of the original text that are not required for understanding its meaning.

If we converted the preceding example into a list of these tokens, it would look something like this:

[ LEFT_BRACE; STRING("title"); COLON; STRING("Cities");
  COMMA; STRING("cities"); ... ]

This kind of representation is easier to work with than the original text, since it gets rid of some unimportant syntactic details and adds useful structure. But it’s still a good deal more low-level than the simple AST we used for representing JSON data in Chapter 18, Handling Json Data:

type value =
  [ `Assoc of (string * value) list
  | `Bool of bool
  | `Float of float
  | `Int of int
  | `List of value list
  | `Null
  | `String of string ]

This representation is much richer than our token stream, capturing the fact that JSON values can be nested inside each other and that JSON has a variety of value types, including numbers, strings, arrays, and objects. The parser we’ll write will convert a token stream into a value of this AST type, as shown below for our earlier JSON example:

`Assoc
  ["title", `String "Cities";
   "cities", `List
     [`Assoc ["name", `String "Chicago"; "zips", `List [`Int 60601]];
      `Assoc ["name", `String "New York"; "zips", `List [`Int 10004]]]]

Defining a Parser

A parser-specification file has suffix .mly and contains two sections that are broken up by separator lines consisting of the characters %% on a line by themselves. The first section of the file is for declarations, including token and type specifications, precedence directives, and other output directives; and the second section is for specifying the grammar of the language to be parsed.   

We’ll start by declaring the list of tokens. A token is declared using the syntax %token <type>uid, where the <type> is optional and uid is a capitalized identifier. For JSON, we need tokens for numbers, strings, identifiers, and punctuation:  

%token <int> INT
%token <float> FLOAT
%token <string> ID
%token <string> STRING
%token TRUE
%token FALSE
%token NULL
%token LEFT_BRACE
%token RIGHT_BRACE
%token LEFT_BRACK
%token RIGHT_BRACK
%token COLON
%token COMMA
%token EOF

The <type> specifications mean that a token carries a value. The INT token carries an integer value with it, FLOAT has a float value, and STRING carries a string value. The remaining tokens, such as TRUE, FALSE, or the punctuation, aren’t associated with any value, and so we can omit the <type> specification.

Describing the Grammar

The next thing we need to do is to specify the grammar of a JSON expression. menhir, like many parser generators, expresses grammars as context-free grammars. (More precisely, menhir supports LR(1) grammars, but we will ignore that technical distinction here.) You can think of a context-free grammar as a set of abstract names, called non-terminal symbols, along with a collection of rules for transforming a nonterminal symbol into a sequence of tokens and nonterminal symbols. A sequence of tokens is parsable by a grammar if you can apply the grammar’s rules to produce a series of transformations, starting at a distinguished start symbol that produces the token sequence in question.            

We’ll start describing the JSON grammar by declaring the start symbol to be the non-terminal symbol prog, and by declaring that when parsed, a prog value should be converted into an OCaml value of type Json.value option. We then end the declaration section of the parser with a %%:

%start <Json.value option> prog
%%

Once that’s in place, we can start specifying the productions. In menhir, productions are organized into rules, where each rule lists all the possible productions for a given nonterminal symbol. Here, for example, is the rule for prog:

prog:
  | EOF       { None }
  | v = value { Some v }
  ;

The syntax for this is reminiscent of an OCaml match expression. The pipes separate the individual productions, and the curly braces contain a semantic action: OCaml code that generates the OCaml value corresponding to the production in question. Semantic actions are arbitrary OCaml expressions that are evaluated during parsing to produce values that are attached to the non-terminal in the rule.   

We have two cases for prog: either there’s an EOF, which means the text is empty, and so there’s no JSON value to read, we return the OCaml value None; or we have an instance of the value nonterminal, which corresponds to a well-formed JSON value, and we wrap the corresponding Json.value in a Some tag. Note that in the value case, we wrote v = value to bind the OCaml value that corresponds to the variable v, which we can then use within the curly braces for that production.

Now let’s consider a more complex example, the rule for the value symbol:

value:
  | LEFT_BRACE; obj = object_fields; RIGHT_BRACE
    { `Assoc obj }
  | LEFT_BRACK; vl = array_values; RIGHT_BRACK
    { `List vl }
  | s = STRING
    { `String s }
  | i = INT
    { `Int i }
  | x = FLOAT
    { `Float x }
  | TRUE
    { `Bool true }
  | FALSE
    { `Bool false }
  | NULL
    { `Null }
  ;

According to these rules, a JSON value is either:  

  • An object bracketed by curly braces

  • An array bracketed by square braces

  • A string, integer, float, bool, or null value

In each of the productions, the OCaml code in curly braces shows what to transform the object in question to. Note that we still have two nonterminals whose definitions we depend on here but have not yet defined: object_fields and array_values. We’ll look at how these are parsed next.

Parsing Sequences

The rule for object_fields follows, and is really just a thin wrapper that reverses the list returned by the following rule for rev_object_fields. Note that the first production in rev_object_fields has an empty left-hand side, because what we’re matching on in this case is an empty sequence of tokens. The comment (* empty *) is used to make this clear:   

object_fields: obj = rev_object_fields { List.rev obj };

rev_object_fields:
  | (* empty *) { [] }
  | obj = rev_object_fields; COMMA; k = ID; COLON; v = value
    { (k, v) :: obj }
  ;

The rules are structured as they are because menhir generates left-recursive parsers, which means that the constructed pushdown automaton uses less stack space with left-recursive definitions. The following right-recursive rule accepts the same input, but during parsing, it requires linear stack space to read object field definitions:  

(* Inefficient right-recursive rule *)
object_fields:
  | (* empty *) { [] }
  | k = STRING; COLON; v = value; COMMA; obj = object_fields
    { (k, v) :: obj }

Alternatively, we could keep the left-recursive definition and simply construct the returned value in left-to-right order. This is even less efficient, since the complexity of building the list incrementally in this way is quadratic in the length of the list:

(* Quadratic left-recursive rule *)
object_fields:
  | (* empty *) { [] }
  | obj = object_fields; COMMA; k = STRING; COLON; v = value
    { obj @ [k, v] }
  ;

Assembling lists like this is a pretty common requirement in most realistic grammars, and the preceding rules (while useful for illustrating how parsing works) are rather verbose. Menhir features an extended standard library of built-in rules to simplify this handling. These rules are detailed in the Menhir manual and include optional values, pairs of values with optional separators, and lists of elements (also with optional separators).  

A version of the JSON grammar using these more succinct Menhir rules follows. Notice the use of separated_list to parse both JSON objects and lists with one rule:

%token <int> INT
%token <float> FLOAT
%token <string> STRING
%token TRUE
%token FALSE
%token NULL
%token LEFT_BRACE
%token RIGHT_BRACE
%token LEFT_BRACK
%token RIGHT_BRACK
%token COLON
%token COMMA
%token EOF
%start <Json.value option> prog
%%

prog:
  | v = value { Some v }
  | EOF       { None   } ;

value:
  | LEFT_BRACE; obj = obj_fields; RIGHT_BRACE { `Assoc obj  }
  | LEFT_BRACK; vl = list_fields; RIGHT_BRACK { `List vl    }
  | s = STRING                                { `String s   }
  | i = INT                                   { `Int i      }
  | x = FLOAT                                 { `Float x    }
  | TRUE                                      { `Bool true  }
  | FALSE                                     { `Bool false }
  | NULL                                      { `Null       } ;

obj_fields:
    obj = separated_list(COMMA, obj_field)    { obj } ;

obj_field:
    k = STRING; COLON; v = value              { (k, v) } ;

list_fields:
    vl = separated_list(COMMA, value)         { vl } ;

We can hook in menhir by adding a (menhir) stanza to our dune file, which tells the build system to switch to using menhir instead of ocamlyacc to handle files with the .mly suffix:   

(menhir
 (modules parser))

(ocamllex lexer)

(library
 (name json_parser)
 (modules parser lexer json)
 (libraries core))

Defining a Lexer

Now we can define a lexer, using ocamllex, to convert our input text into a stream of tokens. The specification of the lexer is placed in a file with an .mll suffix.      

OCaml Prelude

Let’s walk through the definition of a lexer section by section. The first section is an optional chunk of OCaml code that is bounded by a pair of curly braces:  

{
open Lexing
open Parser

exception SyntaxError of string
}

This code is there to define utility functions used by later snippets of OCaml code and to set up the environment by opening useful modules and define an exception, SyntaxError. Any OCaml functions you define here will be subsequently available in the remainder of the lexer definition.

Regular Expressions

The next section of the lexing file is a collection of named regular expressions. These look syntactically like ordinary OCaml let bindings, but really this is a specialized syntax for declaring regular expressions. Here’s an example:   

let int = '-'? ['0'-'9'] ['0'-'9']*

The syntax here is something of a hybrid between OCaml syntax and traditional regular expression syntax. The int regular expression specifies an optional leading -, followed by a digit from 0 to 9, followed by some number of digits from 0 to 9. The question mark is used to indicate an optional component of a regular expression; the square brackets are used to specify ranges; and the * operator is used to indicate a (possibly empty) repetition.

Floating-point numbers are specified similarly, but we deal with decimal points and exponents. We make the expression easier to read by building up a sequence of named regular expressions, rather than creating one big and impenetrable expression:

let digit = ['0'-'9']
let frac = '.' digit*
let exp = ['e' 'E'] ['-' '+']? digit+
let float = digit* frac? exp?

Finally, we define whitespace, newlines, and identifiers:

let white = [' ' '\t']+
let newline = '\r' | '\n' | "\r\n"
let id = ['a'-'z' 'A'-'Z' '_'] ['a'-'z' 'A'-'Z' '0'-'9' '_']*

The newline introduces the | operator, which lets one of several alternative regular expressions match (in this case, the various carriage-return combinations of CR, LF, or CRLF).

Lexing Rules

The lexing rules are essentially functions that consume the data, producing OCaml expressions that evaluate to tokens. These OCaml expressions can be quite complicated, using side effects and invoking other rules as part of the body of the rule. Let’s look at the read rule for parsing a JSON expression:  

rule read =
  parse
  | white    { read lexbuf }
  | newline  { next_line lexbuf; read lexbuf }
  | int      { INT (int_of_string (Lexing.lexeme lexbuf)) }
  | float    { FLOAT (float_of_string (Lexing.lexeme lexbuf)) }
  | "true"   { TRUE }
  | "false"  { FALSE }
  | "null"   { NULL }
  | '"'      { read_string (Buffer.create 17) lexbuf }
  | '{'      { LEFT_BRACE }
  | '}'      { RIGHT_BRACE }
  | '['      { LEFT_BRACK }
  | ']'      { RIGHT_BRACK }
  | ':'      { COLON }
  | ','      { COMMA }
  | _ { raise (SyntaxError ("Unexpected char: " ^ Lexing.lexeme lexbuf)) }
  | eof      { EOF }

The rules are structured very similarly to pattern matches, except that the variants are replaced by regular expressions on the left-hand side. The right-hand side clause is the parsed OCaml return value of that rule. The OCaml code for the rules has a parameter called lexbuf that defines the input, including the position in the input file, as well as the text that was matched by the regular expression.  

The first white { read lexbuf } calls the lexer recursively. That is, it skips the input whitespace and returns the following token. The action newline { next_line lexbuf; read lexbuf } is similar, but we use it to advance the line number for the lexer using the utility function that we defined at the top of the file. Let’s skip to the third action:

| int { INT (int_of_string (Lexing.lexeme lexbuf)) }

This action specifies that when the input matches the int regular expression, then the lexer should return the expression INT (int_of_string (Lexing.lexeme lexbuf)). The expression Lexing.lexeme lexbuf returns the complete string matched by the regular expression. In this case, the string represents a number, so we use the int_of_string function to convert it to a number.

There are actions for each different kind of token. The string expressions like "true" { TRUE } are used for keywords, and the special characters have actions, too, like '{' { LEFT_BRACE }.

Some of these patterns overlap. For example, the regular expression "true" is also matched by the id pattern. ocamllex used the following disambiguation when a prefix of the input is matched by more than one pattern:

  • The longest match always wins. For example, the first input trueX: 167 matches the regular expression "true" for four characters, and it matches id for five characters. The longer match wins, and the return value is ID "trueX".

  • If all matches have the same length, then the first action wins. If the input were true: 167, then both "true" and id match the first four characters; "true" is first, so the return value is TRUE.

Unused Lexing Values

In our parser, we have not used all the token regexps that we defined in the lexer. For instance, id is unused since we do not parse unquoted strings for object identifiers (something that is allowed by JavaScript, but not the subset of it that is JSON). If we included a token pattern match for this in the lexer, then we would have to adjust the parser accordingly to add a %token <string> ID. This would in turn trigger an “unused” warning since the parser never constructs a value with type ID:

File "parser.mly", line 4, characters 16-18:
Warning: the token ID is unused.

It’s completely fine to define unused regexps as we’ve done, and to hook them into parsers as required. For example, we might use ID if we add an extension to our parser for supporting unquoted string identifiers as a non-standard JSON extension.

Recursive Rules

Unlike many other lexer generators, ocamllex allows the definition of multiple lexers in the same file, and the definitions can be recursive. In this case, we use recursion to match string literals using the following rule definition:   

and read_string buf =
  parse
  | '"'       { STRING (Buffer.contents buf) }
  | '\\' '/'  { Buffer.add_char buf '/'; read_string buf lexbuf }
  | '\\' '\\' { Buffer.add_char buf '\\'; read_string buf lexbuf }
  | '\\' 'b'  { Buffer.add_char buf '\b'; read_string buf lexbuf }
  | '\\' 'f'  { Buffer.add_char buf '\012'; read_string buf lexbuf }
  | '\\' 'n'  { Buffer.add_char buf '\n'; read_string buf lexbuf }
  | '\\' 'r'  { Buffer.add_char buf '\r'; read_string buf lexbuf }
  | '\\' 't'  { Buffer.add_char buf '\t'; read_string buf lexbuf }
  | [^ '"' '\\']+
    { Buffer.add_string buf (Lexing.lexeme lexbuf);
      read_string buf lexbuf
    }
  | _ { raise (SyntaxError ("Illegal string character: " ^ Lexing.lexeme lexbuf)) }
  | eof { raise (SyntaxError ("String is not terminated")) }

This rule takes a buf : Buffer.t as an argument. If we reach the terminating double quote ", then we return the contents of the buffer as a STRING.

The other cases are for handling the string contents. The action [^ '"' '\\']+ { ... } matches normal input that does not contain a double quote or backslash. The actions beginning with a backslash \ define what to do for escape sequences. In each of these cases, the final step includes a recursive call to the lexer.

That covers the lexer. Next, we need to combine the lexer with the parser to bring it all together.       

Handling Unicode

We’ve glossed over an important detail here: parsing Unicode characters to handle the full spectrum of the world’s writing systems. OCaml has several third-party solutions to handling Unicode, with varying degrees of flexibility and complexity:

  • Uutf is a nonblocking streaming Unicode codec for OCaml, available as a standalone library. It is accompanied by the Uunf text normalization and Uucd Unicode character database libraries. There is also a robust parser for JSON available that illustrates the use of Uutf in your own libraries.

  • Camomile supports the full spectrum of Unicode character types, conversion from around 200 encodings, and collation and locale-sensitive case mappings.

  • sedlex is a lexer generator for Unicode that can serve as a Unicode-aware replacement for ocamllex.

All of these libraries are available via opam under their respective names.

Bringing It All Together

For the final part, we need to compose the lexer and parser. As we saw in the type definition in parser.mli, the parsing function expects a lexer of type Lexing.lexbuf -> token, and a lexbuf:  

val prog : (Lexing.lexbuf -> token) -> Lexing.lexbuf -> Json.value option

Before we start with the lexing, let’s first define some functions to handle parsing errors. There are currently two errors: Parser.Error and Lexer.SyntaxError. A simple solution when encountering an error is to print the error and give up:  

open Core
open Lexer
open Lexing

let print_position outx lexbuf =
  let pos = lexbuf.lex_curr_p in
  fprintf outx "%s:%d:%d" pos.pos_fname
    pos.pos_lnum (pos.pos_cnum - pos.pos_bol + 1)

let parse_with_error lexbuf =
  try Parser.prog Lexer.read lexbuf with
  | SyntaxError msg ->
    fprintf stderr "%a: %s\n" print_position lexbuf msg;
    None
  | Parser.Error ->
    fprintf stderr "%a: syntax error\n" print_position lexbuf;
    exit (-1)

The “give up on the first error” approach is easy to implement but isn’t very friendly. In general, error handling can be pretty intricate, and we won’t discuss it here. However, the Menhir parser defines additional mechanisms you can use to try and recover from errors. These are described in detail in its reference manual.  

The standard lexing library Lexing provides a function from_channel to read the input from a channel. The following function describes the structure, where the Lexing.from_channel function is used to construct a lexbuf, which is passed with the lexing function Lexer.read to the Parser.prog function. Parsing.prog returns None when it reaches end of file. We define a function Json.output_value, not shown here, to print a Json.value:

let rec parse_and_print lexbuf =
  match parse_with_error lexbuf with
  | Some value ->
    printf "%a\n" Json.output_value value;
    parse_and_print lexbuf
  | None -> ()

let loop filename () =
  let inx = In_channel.create filename in
  let lexbuf = Lexing.from_channel inx in
  lexbuf.lex_curr_p <- { lexbuf.lex_curr_p with pos_fname = filename };
  parse_and_print lexbuf;
  In_channel.close inx

let () =
  Command.basic_spec ~summary:"Parse and display JSON"
    Command.Spec.(empty +> anon ("filename" %: string))
    loop
  |> Command_unix.run

Here’s a test input file we can use to test the code we just wrote:

true
false
null
[1, 2, 3., 4.0, .5, 5.5e5, 6.3]
"Hello World"
{ "field1": "Hello",
  "field2": 17e13,
  "field3": [1, 2, 3],
  "field4": { "fieldA": 1, "fieldB": "Hello" }
}

Now build and run the example using this file, and you can see the full parser in action:

dune exec ./test.exe test1.json
>true
>false
>null
>[1, 2, 3.000000, 4.000000, 0.500000, 550000.000000, 6.300000]
>"Hello World"
>{ "field1": "Hello",
>  "field2": 170000000000000.000000,
>  "field3": [1, 2, 3],
>  "field4": { "fieldA": 1,
>  "fieldB": "Hello" } }

With our simple error handling scheme, errors are fatal and cause the program to terminate with a nonzero exit code:

cat test2.json
>{ "name": "Chicago",
>  "zips": [12345,
>}
>{ "name": "New York",
>  "zips": [10004]
>}
dune exec ./test.exe test2.json
>test2.json:3:2: syntax error
[255]

That wraps up our parsing tutorial. As an aside, notice that the JSON polymorphic variant type that we defined in this chapter is actually structurally compatible with the Yojson representation explained in Chapter 18, Handling Json Data. That means that you can take this parser and use it with the helper functions in Yojson to build more sophisticated applications.

Next: Chapter 20Data Serialization with S-Expressions