How to write cleaner parsing code

When tasked to extract information from a string, most seasoned developers – especially the ones with some kind of Linux background – will resort to regular expressions. While regular expressions are incredibly powerful and very well suited for most parsing jobs, they don’t scale very well. With increasing complexity they tend to become very cryptic, if not unreadable and therefore eventually unmaintainable.

Another alternative is to use parser combinators that combine small atomic parsers to build a complex rule set.

Let’s see how we can use parser combinators in an (admittedly simple) context. One of our customers runs a fascinating web service: Using a easy-to-understand URL scheme, you can remotely control a rendering engine that can effectively combine hundreds of different car parts to render a photo-realistic image of a specific car configuration. One task in a recent project was to extract car configuration data from a web service call.

A real world example

Our customer’s API is confidential, so I had to alter a few specifications and sadly can’t give any example images.

As mentioned before, the URL scheme is straightforward, but it seems to be more human-friendly than machine-friendly, but more about that later. It consists of the following parts:

  1. The country version’s ISO code
  2. Car metadata (model is mandatory, body and grade are optional)
  3. Optional codes for
    1. option codes
    2. equipment packs
    3. accessories
  4. Image metadata like resolution, background colour and car angle

As you can see, there is a bit of inconsistency that might cause some inconvenience while parsing: the country code doesn’t have a qualifier, the car model, body and grade info are delimited by slashes, the other fields are delimited by commas. Also there’s some information encoded in the filename, but we can ignore that as we’re only parsing the car configuration, not the image metadata.

The right tool for our task

The basic idea of any parser combinator is that it takes an input, reads (“consumes”) it until it is either done or fails, in both cases returning the result and the remaining unparsed string, which might be fed into subsequent parsers. In this example we will be using the Parsec library to manage the grunt work for us. It seems well-suited for this task, it’s quite readable and as straightforward as something that calls itself “industrial strength, monadic parser combinator” can be and there are many clones and implementations in various programming languages.

Parsec is written in the pure functional programming language Haskell, but to understand the examples in this post, you won’t need to understand Haskell. I chose Haskell for the code examples because the parsers contain barely any language-specific syntax, all the handling is nicely kept away from us, making the code almost look like pseudo code.

Writing the parser

As mentioned earlier, a parser combinator is built up from simple building blocks combined to complex parsers. There are quite a lot predefined parsers for common tasks that we can re-use and combine. It’s fascinating that all parsers are independent and can be executed on their own, so everything is easily maintainable and testable.

Parsing characters and letters

We’re dealing with a URL, so even without knowing the specs, we can guess that we will encounter “things delimited by slashes”. According to the web service specification, these “things” can only consist of characters and numbers. We’ll start with a parser that reads these symbols between the slashes and call it value.

value = do
    many1 alphaNum

many1 and alphaNum are parsers already defined in the Parsec library. When run, our combined parser expects one or more (many1) alphanumeric symbols, i.e. letters or numbers (alphaNum). If the input matches these characters it will succeed, if it encounters any other symbol, it will fail. The result of the last line in our do-block is returned automatically.

Using the ‘or’ operator

I lied when I said that values can consist only of alphanumeric characters. Actually the web service specifications also allow the use of “+” and “-“, so we’ll need to add these to our parser. A nice way to achieve this is to use the <|>-operator that basically just means “or”:

value = do
    many1 (alphaNum <|> char '+' <|> char '-')

Another option would be to use ... <|> oneOf "+-".

With this out of the way we have everything we need to combine this parser with another parser that only expects a slash. The resulting parser will be used to read the parts of our URL (you remember: things delimited by slashes). For lack of a better name, let’s call it part.

part = do
    char '/'

This does exactly what you might expect: It reads the character “/” and then a value as defined before.

Failing gracefully

I mentioned earlier that parser functions either return the value or fail. In most imperative programming languages, failing usually means that an error or exception will be thrown. Our parsers work a bit differently: Haskell has a special type called Either that represents values with two possibilities called Left and Right. By convention, Left is used to hold an error value and Right is used to hold a correct value. You’ll see in a minute how that looks.

Let’s confirm in Haskell’s REPL that our parsers work. For this we’ll use the parse helper function that expects the following parameters: the parser to execute, a context name (that we’ll just leave empty) and the input string.

> parse part "" "/foo+bar-42"
Right "foo+bar-42"

> parse part "" "quux"
Left (line 1, column 1):
unexpected "q"
expecting "/"

The results should be self-explanatory: The first parser succeeded, returning a Right with the parsed string. The second call failed, returning a Left with a detailed error message.

Some real work: Parsing strings and optionals

Now let’s start to do some real work and write a parser that reads the first part of the URL including the country code and return that. The protocol can be either “http” or “https”, so we need to take care of both options, for the rest we can reuse our part-parser.

This parser introduces two new predefined parsers: string, which reads any given string (as opposed to char that only reads a single character) and optional that (surprise!) marks a parser as optional.

country = do
    string "http"
    optional (char 's')
    string "://"

In action:

> parse country "" ""
Right "de"

> parse country "" ""
Left (line 1, column 5):
unexpected "n"
expecting "://"

Parsing delimited lists

We’ll skip the vehicle part for now and implement the parsers for the car customization first. Options, accessories and packs are all quite similar: a key followed by a slash and a comma separated list of values. Let’s start with an interesting parser: sepBy splits a string using a separator and returns the results. The output won’t be a string now, but a list of strings instead:

commaDelimited = do
    sepBy value (char ',')

In the REPL this looks like this:

> parse commaDelimited "" "1,2,3,4"
Right ["1","2","3","4"]

The rest is easy:

packs = do
    string "/packs/"

options = do
    string "/options/"

accessories = do
    string "/accessories/"

Looking ahead

With that out-of-the-way we can concentrate on the most interesting part now: The vehicle information that can consist of one or more parts. Let’s start with a quite naive approach:

vehicle = do
    string "/vehicle"
    many1 part

This will obviously not work as it will read beyond the vehicle information and into the customization:

> parse vehicle "" "/vehicle/ford/prefect/packs/pack307"
Right ["ford","prefect","packs","pack307"]

We need a way to tell the parser where to stop. That can be done with the manyTill parser that runs a parser until another parser matches. If we want to check for more than one parser, we can use choice which expects a list of parsers and tries them consecutively. The problem is that parsers works greedily – so any character that matches a parser and is therefor already consumed stays consumed, even if the parser should eventually fail. We can avoid this by using try which makes the parser only consume input when it’s fully executed.

The tricky part is: We don’t want that either. We only want to check, if a parser will succeed, but not actually consume anything. The lookAhead parser does exactly that.

customization = do
    choice [try packs, try accessories, try options]

vehicle = do
    string "/vehicle"
    manyTill part (lookAhead customization)

This will first expect the string “/vehicle” and then run the part parser for as long as possible until our customization parser succeeds, but it will only consume the results of part.

Skipping the rest

That’s it, we don’t need the rest (skipMany), but we should still consume it.

theRest = do
    skipMany anyChar

Returning more than one result

We just created all the building blocks we need to parse the url, but we have to combine them. Up to this point, all parsers did something and returned the last result after running all parsers. What if we want the results of all parsers, i.e. the in-betweens?

Let’s take a step back first and write a parser that reads exactly two alphanumeric characters and then expects the input to end (eof):

twoChars = do
> parse twoChars "" "xy"
Right 'y'

That’s nice, but how do we get to the first x? As each parser returns a result itself, we can just pull it out and store it in a variable using the left arrow <- and return a tuple containing all variables. You don’t have to return a tuple, you can return whatever data type holds your values and matches your domain.

twoChars = do
    x <- alphaNum
    y <- alphaNum
    return (x, y)

Integrating everything

Wrapping it all up, here’s our complete parser.

parser = do
    c  <- country
    v  <- vehicle
    cs <- many1 customization
    return (c, v, cs)

Taking it for a test run:

> parse parser "" ",4711,815/packs/p7/accessories/a,b,c/width/1024/height/768/exterior-45.jpg"

Right ("de",["trabant","5-doors","0815+universal"],[["1","4711","815"],["p7"],["a","b","c"]])



We just wrote a very maintainable parser in less than 40 lines of very readable code. To achieve this, we wrote small and comprehensible parsing functions that we combined to larger parsers.

Nothing we parsed was very impressive or particularly complex and a few loops combined with regular expressions could have achieved the same with maybe even less code. But as soon as a certain complexity is exceeded, the specification is prone to changes, or you want to create cleaner, more testable code that is actually re-usable, parser combinators can be a great alternative.

You can find the complete source (including type annotations) of the parser here: parser.hs.

What’s next?

This barely touches the surface of what Parsec is able to do. If you’re interesting in learning more, an excellent starting point is chapter 16 of Real World Haskell by Bryan O’Sullivan, Don Stewart, and John Goerzen. It goes into much more detail, but expects quite a bit of Haskell knowledge. Also you can Write yourself a Scheme in 48 Hours using Parsec.

If you want to understand parser combinators in general, not necessarily only the Parsec library, check out Graham Hutton’s great book Programming in Haskell. Erik Meijers functional programming MOOC at edX also covers the topic, closely following Hutton’s book.

As mentioned earlier, there are parser combinator libraries for many languages, here’s a short and by no means complete list:
* JParsec for Java
* FParsec for F#
* PyParsing for Python

If you want to know more about the details of the do-notation and return-stuff or just impress your friends with Zygohistomorphic prepromorphisms, Learn You A Haskell is an excellent start. At the time of writing, Chris Allen and Julie Moronuki are about 90% done with their Haskell Book. I’m sure it will be awesome.

The image Legos by qrevolution is licensed under the Creative Commons Attribution 2.0 Generic (CC BY 2.0) licence.

Published by

Christoph von Krüchten

Avid senior software architect. Junior ascetic. Haskell fanatic. Amateur cook. Interested in everything .NET. Powered by an insatiable thirst for coffee.