On this page:
3.1 Annotating parsers to produce syntax
3.2 Parsing tokens from parser-tools/  lex
8.12

3 Producing Syntax🔗ℹ

One of the properties of megaparsack is that it always tracks source locations, which is how it is able to include source location information in error messages. This can be leveraged for an entirely separate purpose, which is creating parsers that produce syntax objects as output. This functionality is extremely useful when creating custom #lang languages.

3.1 Annotating parsers to produce syntax🔗ℹ

Megaparsack does not opt to produce syntax objects as the result of every parse because it would make composing parsers extremely tedious. For example, if integer/p produced syntax objects containing integers instead of integers themselves, they would need to be unwrapped before they could be added together or otherwise used as numbers. Instead, megaparsack requires that you opt-in to syntax object production by using the syntax/p combinator.

The syntax/p combinator is “magic”—it takes any parser and turns it into a parser that produces a value containing accurate source location information. This is because syntax/p takes advantage of the internal parser state to track information that is otherwise not accessible to parsers. Fortunately, this makes the interface extremely simple to use—just wrap an ordinary parser with syntax/p and use it as usual:

> (parse-string (syntax/p integer/p) "42")

(success #<syntax:string:1:0 42>)

The produced syntax objects automatically keep track of all the relevant syntax properties, including line, column, position, and span:

> (define stx (parse-result! (parse-string (syntax/p integer/p) "42")))
> (syntax-line stx)

1

> (syntax-column stx)

0

> (syntax-span stx)

2

This syntax tracking is not specific to the built-in parsers, and you do not need to do anything special to use it with your custom parsers. For example, consider a relatively complex parser that parses a list of comma-delimited numbers surrounded by brackets:

> (define integer-list/p
    (do (char/p #\[)
        [ints <- (many/p (syntax/p integer/p) #:sep (char/p #\,))]
        (char/p #\])
        (pure ints)))

We’ve annotated the integer/p parser with syntax/p once again so we can get location tracking for each individual list element, but we’ll also annotate the whole thing with syntax/p so we can track information about the entire list as well:

> (define integer-list-stx
    (parse-result! (parse-string (syntax/p integer-list/p) "[1,2,3,5,8,13]")))
> integer-list-stx

#<syntax:string:1:0 (1 2 3 5 8 13)>

> (syntax-span integer-list-stx)

14

As expected, the top-level syntax object spans the entire input, including the brackets. We can also get information about the individual elements, since they are syntax objects as well:

> (syntax->list integer-list-stx)

'(#<syntax:string:1:1 1>

  #<syntax:string:1:3 2>

  #<syntax:string:1:5 3>

  #<syntax:string:1:7 5>

  #<syntax:string:1:9 8>

  #<syntax:string:1:11 13>)

This makes writing a reader for a #lang relatively straightforward because source location information is already encoded into a set of syntax objects which can be used as the source of a Racket module.

3.2 Parsing tokens from parser-tools/lex🔗ℹ

While syntax/p can be used with any megaparsack parser, it is sometimes useful to be able to perform a lexing phase before parsing to handle things like ignoring whitespace and tokenization in a separate pass. Currently, megaparsack does not include tools of its own specifically for lexing (though it would be perfectly possible to use the output of a separate simple parser as the input to another parser), but it does provide a function to interoperate with parser-tools/lex, another Racket library that provides utilities designed specifically for lexing.

When using parser-tools/lex, make sure to use the lexer-src-pos form, which enables the lexer’s own source location tracking. This configures the lexer to produce position-token values as output, which can be fed to parse-tokens from megaparsack/parser-tools/lex to parse with any megaparsack parser.

Parsers that operate on strings, like char/p and integer/p, will not work with tokens from parser-tools/lex because tokens can contain arbitrary data. Instead, use the token/p function to create parsers that handle particular tokens.

Here is a very simple lexer that produces lexemes for identifiers, numbers, parentheses, and commas:

> (define-tokens simple [IDENTIFIER NUMBER])
> (define-empty-tokens simple* [OPEN-PAREN CLOSE-PAREN COMMA])
> (define simple-lexer
    (lexer-src-pos
     [#\( (token-OPEN-PAREN)]
     [#\) (token-CLOSE-PAREN)]
     [#\, (token-COMMA)]
     [(:+ (:or (:/ #\a #\z) (:/ #\A #\Z)))
      (token-IDENTIFIER (string->symbol lexeme))]
     [(:+ (:/ #\0 #\9))
      (token-NUMBER (string->number lexeme))]
     [(:or whitespace blank iso-control) (void)]
     [(eof) eof]))

We can write a simple helper function to lex a string into a list of tokens, making sure to call port-count-lines! to enable source location tracking:

> (define (lex-simple str)
    (define in (open-input-string str))
    (port-count-lines! in)
    (let loop ([v (simple-lexer in)])
      (cond [(void? (position-token-token v)) (loop (simple-lexer in))]
            [(eof-object? (position-token-token v)) '()]
            [else (cons v (loop (simple-lexer in)))])))
> (lex-simple "f(1, g(3, 4))")

(list

 (position-token (token 'IDENTIFIER 'f) (position 1 1 0) (position 2 1 1))

 (position-token 'OPEN-PAREN (position 2 1 1) (position 3 1 2))

 (position-token (token 'NUMBER 1) (position 3 1 2) (position 4 1 3))

 (position-token 'COMMA (position 4 1 3) (position 5 1 4))

 (position-token (token 'IDENTIFIER 'g) (position 6 1 5) (position 7 1 6))

 (position-token 'OPEN-PAREN (position 7 1 6) (position 8 1 7))

 (position-token (token 'NUMBER 3) (position 8 1 7) (position 9 1 8))

 (position-token 'COMMA (position 9 1 8) (position 10 1 9))

 (position-token (token 'NUMBER 4) (position 11 1 10) (position 12 1 11))

 (position-token 'CLOSE-PAREN (position 12 1 11) (position 13 1 12))

 (position-token 'CLOSE-PAREN (position 13 1 12) (position 14 1 13)))

Next, we can write a trivial parser to actually parse these tokens. Since we’ve written a lexer, most of the heavy lifting is already done, and we can just focus on assigning semantics:

; some wrappers around tokens that use syntax/p
> (define number/p (syntax/p (token/p 'NUMBER)))
> (define identifier/p (syntax/p (token/p 'IDENTIFIER)))
; a simple function invokation
> (define funcall/p
    (syntax/p
     (do [func <- identifier/p]
         (token/p 'OPEN-PAREN)
         [args <- (many/p expression/p #:sep (token/p 'COMMA))]
         (token/p 'CLOSE-PAREN)
         (pure (list* func args)))))
; an expression can be a number or a function invokation
> (define expression/p
    (or/p number/p
          funcall/p))

Now, with our simple parser in place, we can actually parse arbitrary C-style function calls into S-expressions:

> (define expr-stx
    (parse-result! (parse-tokens expression/p (lex-simple "f(1, g(3, 4))"))))
> expr-stx

#<syntax:tokens:1:0 (f 1 (g 3 4))>

As expected, the source locations for each datum will automatically be assigned to the resulting syntax object due to the use of syntax/p on the base datums and around funcall/p:

> (syntax->list expr-stx)

'(#<syntax:tokens:1:0 f> #<syntax:tokens:1:2 1> #<syntax:tokens:1:5 (g 3 4)>)

In just a couple dozens lines of code, we’ve managed to implement a fairly robust parser that produces syntax objects ready to be handed off to Racket as the result of parsing a module body, which can be compiled into working Racket code.