Unicode Break Algorithms
1 Grapheme Breaks
in-graphemes
string-split-graphemes
string-split-graphemes/  immutable
string-grapheme-indexes
2 Word Breaks
char-word-break-property
string-word-break-at?
string-word-span
in-words
string-split-words
string-split-words/  immutable
string-word-break-indexes
3 Sentence Breaks
char-sentence-break-property
in-sentences
string-split-sentences
string-split-sentencess/  immutable
string-sentence-indexes
4 Other functions
char-east-asian-width-property
8.12

Unicode Break Algorithms🔗ℹ

 (require unicode-breaks) package: unicode-breaks

Racket 8.7 added basic support for working with Unicode grapheme clusters, where multiple codepoints make up an entity that is rendered as a single character. This module expands that functionality, and adds word and sentence breaks from Unicode Annex #29, Text Segmentation. It does not attempt to provide language/locale specific algorithms.

The rules used are in accordance with Unicode 15.0, to match Racket 8.9.

1 Grapheme Breaks🔗ℹ

procedure

(in-graphemes str [start end])  (sequence/c string?)

  str : string?
  start : exact-nonnegative-integer? = 0
  end : exact-nonnegative-integer? = (string-length str)
Returns a sequence that produces a series of strings, one grapheme of the specified range of str per entry. It is undefined if start is not the initial index of a grapheme sequence.

procedure

(string-split-graphemes str [start end])  (listof string?)

  str : string?
  start : exact-nonnegative-integer? = 0
  end : exact-nonnegative-integer? = (string-length str)
Returns a list of the graphemes of the specified range of str. It is undefined if start is not the initial index of a grapheme sequence.

procedure

(string-split-graphemes/immutable str 
  [start 
  end]) 
  (listof (and/c string? immutable?))
  str : string?
  start : exact-nonnegative-integer? = 0
  end : exact-nonnegative-integer? = (string-length str)
Same as string-split-graphemes, but returns immutable strings.

procedure

(string-grapheme-indexes str [start end])

  (listof exact-nonnegative-integer?)
  str : string?
  start : exact-nonnegative-integer? = 0
  end : exact-nonnegative-integer? = (string-length str)
Returns a list of the starting indexes of each grapheme in the specified range of str. It is undefined if start is not the initial index of a grapheme sequence.

2 Word Breaks🔗ℹ

procedure

(char-word-break-property ch)  symbol?

  ch : char?
Returns the Unicode word break property of the given character, which is one of the following symbols: 'ALetter, 'CR, 'Double_Quote, 'Extend 'ExtendNumLet, 'Format, 'Hebrew_Letter, 'Katakana, 'LF, 'MidLetter, 'MidNum, 'MidNumLet, 'Newline, 'Numeric, 'Other, 'Regional_Indicator, 'Single_Quote, 'WSegSpace or 'ZWJ.

procedure

(string-word-break-at? str i [start end])  boolean?

  str : string?
  i : exact-nonnegative-integer?
  start : exact-nonnegative-integer? = 0
  end : exact-nonnegative-integer? = (string-length str)
Returns #t if a word break exists before the character at index i. There is always a break before start and end.

procedure

(string-word-span str start [end])  exact-nonnegative-integer?

  str : string?
  start : exact-nonnegative-integer?
  end : exact-nonnegative-integer? = (string-length str)
Returns the number of characters/codepoints in the string before the next Unicode word break starting from start and not going past end.

procedure

(in-words str    
  [start    
  end    
  #:skip-blanks? skip-blanks?])  (sequence/c string?)
  str : string?
  start : exact-nonnegative-integer? = 0
  end : exact-nonnegative-integer? = (string-length str)
  skip-blanks? : any/c = #f
Returns a sequence that produces a series of strings, one word of the specified range of str per entry. If #:skip-blanks? is true, "words" that consist only of white space are omitted.

procedure

(string-split-words str 
  [start 
  end 
  #:skip-blanks? skip-blanks?]) 
  (listof string?)
  str : string?
  start : exact-nonnegative-integer? = 0
  end : exact-nonnegative-integer? = (string-length str)
  skip-blanks? : any/c = #f
Returns a list of the words in the specified range of str. If #:skip-blanks? is true, "words" that consist only of white space are omitted.

procedure

(string-split-words/immutable str 
  [start 
  end 
  #:skip-blanks? skip-blanks?]) 
  (listof (and/c string? immutable?))
  str : string?
  start : exact-nonnegative-integer? = 0
  end : exact-nonnegative-integer? = (string-length str)
  skip-blanks? : any/c = #f
Same as string-split-words, but returns immutable strings.

procedure

(string-word-break-indexes str [start end])

  (listof exact-nonnegative-integer?)
  str : string?
  start : exact-nonnegative-integer? = 0
  end : exact-nonnegative-integer? = (string-length str)
Returns a list of the indexes of each word break in the specified range of str. The implicit breaks at the beginning and end of the string are included.

3 Sentence Breaks🔗ℹ

procedure

(char-sentence-break-property ch)  symbol?

  ch : char?
Return the Unicode sentence break property of the given character, which is one of the following symbols: 'ATerm, 'CR, 'Close, 'Extend, 'Format, 'LF, 'Lower, 'Numeric, 'OLetter, 'Other, 'SContinue, 'STerm, 'Sep, 'Sp or 'Upper.

procedure

(in-sentences str [start end])  (sequence/c string?)

  str : string?
  start : exact-nonnegative-integer? = 0
  end : exact-nonnegative-integer? = (string-length str)
Returns a sequence that produces a series of strings, one sentence in the specified range of str per entry. It is undefined if start is not the initial index of a sentence.

procedure

(string-split-sentences str [start end])  (listof string?)

  str : string?
  start : exact-nonnegative-integer? = 0
  end : exact-nonnegative-integer? = (string-length str)
Returns a list of the sentences of the specified range of str. It is undefined if start is not the initial index of a sentence.

procedure

(string-split-sentencess/immutable str 
  [start 
  end]) 
  (listof (and/c string? immutable?))
  str : string?
  start : exact-nonnegative-integer? = 0
  end : exact-nonnegative-integer? = (string-length str)
Same as string-split-sentencess, but returns immutable strings.

procedure

(string-sentence-indexes str [start end])

  (listof exact-nonnegative-integer?)
  str : string?
  start : exact-nonnegative-integer? = 0
  end : exact-nonnegative-integer? = (string-length str)
Returns a list of the indexes of the start of each sentence in the specified range of str. It is undefined if start is not the initial index of a sentence.

4 Other functions🔗ℹ

procedure

(char-east-asian-width-property ch)  (or/c 'N 'Na 'H 'A 'F 'W)

  ch : char?
Returns the Annex #11 East Asian Width property assigned to the given character.