amb-parser: Parser generator for ambiguous grammars

8.12

amb-parser: Parser generator for ambiguous grammars🔗ℹ

Ruslan Popov

1 Introduction🔗ℹ

This is a very simple parser generator that mimics brag (and actually its built on top of it). It is designed for parsing English text.

Its main features are:

Ambiguous grammar support. Parser generates all possible derivation trees.
Support for tokens that have multiple types attached. For example, the word cook in English may be either a verb or a noun.

I don’t know whether the implementation works correctly.

2 Parser language🔗ℹ

The language for amb-parser is similar to one in brag, but with several differences:

Semicolon is required at the end of a production.
Terminals are written in lowercase, non-terminals are in uppercase.

Note: the parser does not support left-recursion. (Yet Another Parser that Doesn’t Support Left Recursion, yapdslr).

Example:

#lang amb-parser
S : NP VP;
NP : determiner? adjective* noun;
VP : verb NP;

The parser supports ONLY:

Rule combinations:
S : NP VP;
Optional rule:
NP : determiner? noun;
Kleene star rule:
NP : adjective* noun;
Alteration on the top level of an expansion:
A : b | c;

Note: the parser does not support grouping.

Note: you are allowed to combine question and star operators only in the order: rule*?.

2.1 Formal parser language definition🔗ℹ

#lang brag

parser-syntax : rule+

rule : NON-TERMINAL COLON alteration SEMICOLON

alteration : expansion (ALTERATION expansion)*
expansion : (non-terminal | terminal)*

non-terminal : NON-TERMINAL [STAR] [QUESTION]
terminal : TERMINAL [STAR] [QUESTION]

3 Parser interface🔗ℹ

Firstly, import amb-parser in your Racket module.
Then create a file with
#lang amb-parser
line and with your grammar and require it in your Racket module.

You will be provided with the parse function. You can use this function to parse your text. But before, your text should be separated into tokens.

For more details, refer to the parse documentation.

4 Usage example🔗ℹ

Let’s take the grammar in Parser interface:

#lang amb-parser
S : NP VP;
NP : determiner? adjective* noun;
VP : verb NP;

For sentence "The big cat catched a small grey mouse.", which is transformed in the list of tokens:

(list
  (token "the" '(determiner))
  (token "big" '(adjective))
  (token "cat" '(noun))
  (token "catched" '(verb))
  (token "a" '(determiner))
  (token "small" '(adjective))
  (token "grey" '(adjective))
  (token "mouse" '(noun)))

The parser will generate such a result:

(list
  (parser-result
    '(S
       (NP (determiner "the")
           (adjective* (adjective "big") (adjective*))
           (noun "cat"))
       (VP
         (verb "catched")
         (NP
           (determiner "a")
           (adjective* (adjective "small") (adjective* (adjective "grey") (adjective*)))
           (noun "mouse"))))
    '()))

Note how the star rule works.

5 Parse tree structure🔗ℹ

In this section we will describe the structure of the parse tree without specifying that the parser returns multiple results and every parser-result contains the rest of tokens.

Every rule name : expansion; will be parsed as (name expansion).
Every expansion rule1 rule2 will be parsed as an association list, where keys and values are organized as list with two elements, keys are rule names and values are the result of parsing the rules.
Every terminal terminal will be parsed as (list ’terminal token-str) where token-str is the string of the matched token.

If some parsing is failed then the parse returns empty (meaning there is no derivation trees and no parser results).

6 Types and functions🔗ℹ

6.1 Tokens🔗ℹ

struct
(struct token (str pos)
    #:transparent)
  str : string?
  pos : (listof symbol?)

Represents a token.

pos is a list of all possible parts of speech for this token.

When parser tries to parse a terminal rule it checks whether the terminal is a member of pos list.

Note: in fact, the implementation does not really utilizes the info that the str is a string?.

procedure
(token? x) → boolean?
x : any/c

Returns whether the x is a token.

procedure
(token-str token) → string?
token : token?

Returns the string of the token.

procedure
(token-pos token) → (listof symbol?)
token : token?

Returns all possible parts of speech for the token.

6.2 Parser result🔗ℹ

struct
(struct parser-result (data rest)
    #:transparent)
  data : any/c
  rest : (listof token?)

Represents a single result of parsing. It consists of parsed data and the rest of tokens (uparsed part).

procedure
(parser-result? x) → boolean?
x : any/c

Returns whether the x is a parser-result.

procedure
(parser-result-data res) → any/c
res : parser-result?

Returns the parsed data of the res.

procedure
(parser-result-rest res) → (listof token?)
res : parser-result?

Returns the uparsed tokens of the res.

6.3 Parsing🔗ℹ

procedure
(parse non-terminal tokens) → (listof parser-result?)
non-terminal : symbol?
tokens : (listof token?)

Parses non-terminal from the tokens list.

Parser returns a list of all possible derivation trees. The results are organized into parser-result structures.

WARNING: There is a bad design desicion. The parse function requires a list of tokens, instead of a lambda function that generates tokens on demand.

1	Introduction
2	Parser language
3	Parser interface
4	Usage example
5	Parse tree structure
6	Types and functions