1 Tokens

8.12

1 Tokens🔗ℹ

package: yaragg

A token is a single unit of text. Tokenizing is the act of converting a sequence of characters into a sequence of tokens, where each character becomes part of exactly one token. Tokens are the inputs to parsers: a parser’s job is to convert a sequence of tokens into a structured parse tree. There are four types of tokens in Yaragg: atoms, punctuation, whitespace, and comments.

procedure
(token? v) → boolean?
v : any/c

1.1 Atom Tokens🔗ℹ

An atom is a token that represents a single independent piece of text that’s relevant to parsing, such as a symbolic name or a literal number. Atoms become the leaves of a parse tree when parsing tokens. Atoms can have types, for parsers to distinguish different kinds of atoms, and values, which don’t affect parsing but do end up embedded in the final parse tree.

procedure
(atom? v) → boolean?
v : any/c

procedure
(atom type value [#:properties properties]) → atom?
  type : symbol?
  value : any/c
  properties : hash? = (hash)

procedure
(atom-type atom) → symbol?
atom : atom?

procedure
(atom-value atom) → any/c
atom : atom?

procedure
(atom-properties atom) → (and/c hash? immutable?)
atom : atom?

1.2 Punctuation Tokens🔗ℹ

A punctuation token is a token that represents a piece of text whose purpose is to separate or delimit other pieces of text, such as parentheses, brackets, and semicolons in various programming languages. Parsers use punctuation when deciding how to structure atoms into parse trees.

procedure
(punctuation? v) → boolean?
v : any/c

procedure
(punctuation string) → punctuation?
string : string?

procedure
(punctuation-string punctuation) → (and/c string? immutable?)
punctuation : punctuation?

1.3 Whitespace Tokens🔗ℹ

A whitespace token is a token made up of invisible whitespace characters. Whitespace tokens only exist to separate other tokens and do not affect parsing. Adding and removing whitespace tokens from a token stream does not change the behavior of any Yaragg parser. However, other tools such as formatters and pretty printers may be affected by whitespace tokens.

A whitespace token always represents some number of linebreaks followed by some number of spaces. This representation disallows trailing whitespace, which is usually desirable when creating and parsing programming languages.

procedure
(whitespace? v) → boolean?
v : any/c

procedure
(whitespace [ #:linebreak-count linebreak-count
#:space-count space-count]) → whitespace?
linebreak-count : exact-nonnegative-integer? = 0
space-count : exact-nonnegative-integer? = 0

procedure
(whitespace-linebreak-count space) → exact-nonnegative-integer?
space : whitespace?

procedure
(whitespace-space-count space) → exact-nonnegative-integer?
space : whitespace?

1.4 Comment Tokens🔗ℹ

A comment token is a token made up of arbitrary text that is only relevant to human readers. Comments are ignored by parsers. Formatters may move comments around, but they won’t remove them.

procedure
(comment? v) → boolean?
v : any/c

procedure
(comment text) → comment?
text : string?

procedure
(comment-text comment) → (and/c string? immutable?)
comment : comment?

1.5 Lexemes🔗ℹ

A lexeme is a combination of a token and information about the original source location of that token. Parsers normally produce parse trees from token streams, but a lexeme stream is needed when parsing input into syntax objects with source locations.

procedure
(lexeme? v) → boolean?
v : any/c

procedure
(lexeme token location) → lexeme?
token : token?
location : srcloc?

procedure
(lexeme-token lexeme) → token?
lexeme : lexeme?

procedure
(lexeme-location lexeme) → srcloc?
lexeme : lexeme?

1	Tokens
2	Grammars
3	Production expressions
4	Derivation Trees
5	Semantic Actions
6	Parsers

1.1	Atom Tokens
1.2	Punctuation Tokens
1.3	Whitespace Tokens
1.4	Comment Tokens
1.5	Lexemes