2 Framework for Implementing Additional Tools

8.12

2 Framework for Implementing Additional Tools🔗ℹ

2.1 Managing Python Processes from Racket🔗ℹ

package: pydrnlp

procedure
(python-worker? v) → boolean?
  v : any/c
syntax
(define-python-worker id
  mod-bytes-literal arg-bytes-literal ...)

   id? : (-> any/c boolean?)
   id-revision : jsexpr?
   launch-id : (->* [] [#:quiet? any/c] id?)
   id-send/raw :
(->* [id? jsexpr?]
     [#:who symbol?]
     (stream/c jsexpr?))
procedure
(python-worker-running? worker) → boolean?
  worker : python-worker?
procedure
(python-worker-kill worker) → any
  worker : python-worker?
procedure
(python-worker-dead-evt worker) → (evt/c (or/c #f exn:fail?))
  worker : python-worker?

A Python worker value encapsulates a Python process running one of pydrnlp’s Python modules and manages request–response communication between Racket and Python. All Python workers are recognized by the predicate python-worker? and share the same abstract interface.

Every Python worker is an instance of some concrete Python worker type, which specifies a particular Python module with which to communicate. A Python worker type is defined by the define-python-worker form, which binds id?, id-revision, launch-id, and id-send/raw (synthesized with the lexical context of id) to values implementing the type-specific portion of the interface. The mod-bytes-literal must name one of pydrnlp’s Python modules using the same syntax as python -m: for example, #"pydrnlp.trends". Any additional arg-bytes-literals are passed as command-line arguments to the Python module. The Python module named by mod-bytes-literal must define a Python revision function (see pydrnlp/support/python-lang and python-revision-value/c): if it does not, the define-python-worker form will raise a syntax error.

The launch-id function creates Python worker values of the concrete Python worker type, which are recognized by the predicate id?. If the #:quiet argument is given and is not #false, the Python process’s standard error output is written to the current-error-port; otherwise, it is discarded. Creating a Python worker allocates system-level resources—in particular, it starts an OS subprocess. These resources are placed in the custody of the current custodian, and they must be freed when the worker is no longer needed, typically by calling either python-worker-kill or custodian-shutdown-all. More generally, the system-level resources are freed when the worker becomes dead.

Calling id-send/raw sends a jsexpr as a request to the process encapsulated by the Python worker and returns its response as a lazy stream of jsexpr values. To implement the Python side of this interaction, use pydrnlp.jsonio. Messages are sent to the Python process asynchronously, in sequential order, and id-send/raw returns immediately, though forcing the returned stream (e.g. with stream-first or stream-rest) will block until the request has been sent and the response has begun to be received. Python worker values are thread-safe: id-send/raw can be called with the same worker concurrently from multiple Racket threads, and one client thread being terminated, blocking, etc. will not interfere with use of the worker from other client threads. However, workers intentionally are not fully “kill-safe” in the sense of [Flatt04]: shutting down the managing custodian must release the system-level resources, and clients can cause the worker to become dead (intentionally or not) in various other ways.

Process-level parallelism can be obtained by using launch-id to create multiple workers of the same Python worker type. However, note that the digitalricoeur.org server currently doesn’t have very many cores, anyway.

The functions implemented by Python worker types are often expensive. The id-revision value is defined to support caching and avoid redundant calls to id-send/raw. When id-revision is #false, any cached value should be ignored. Otherwise, if id-revision is equal? to a cached value of id-revision from a previous run, it means that the Python module encapsulated by the Python worker type promises that calling id-send/raw with “the same” request would produce “the same” response, and therefore cached responses can be used. Of course, the Python module must take care to live up to this promise when implementing its Python revision function. Note that the applicable notion of “the same” is specific to the Python module and Python worker type: “the same” may mean something either stronger or weaker that equal?. In addition to the Python revision function, the value of id-revision also reflects the versions of the spaCy library and the language models being used.

Note that access to id-revision does not require running Python (or even having it installed), even though id-revision incorporates values defined in Python code. Instead, by enforcing constraints on the syntax of Python revision functions, pydrnlp/support/python-lang is able to analyse their definitions statically and compile them to Racket code.

The name of id-send/raw reflects the fact that it enforces only the raw, jsexpr-based communication protocol common to all Python workers. In practice, the Racket and Python parties to a particular interaction will both have invariants about the messages they expect to send and receive. When designing a new Python worker type, id-send/raw should be used to implement higher-level communication functions, which can enforce specific contracts and convert values to and from the jsexpr representation used for communication. To facilitate such wrapper functions, id-send/raw will report errors that cannot be detected by first-order tests using its #:who argument, if given, rather than its own symbolic name. On the other hand, id-send/raw does enforce its documented contract and will blame its callers for violating their obligations.

For example, a test like:

(when (python-worker-running? a-worker)
(id-send/raw a-worker "Hi, world!"))

is not sufficient to ensure that a-worker is running when id-send/raw is called, because a-worker could have become dead concurrently, after python-worker-running? returns but before id-send/raw is called.

If the Python worker given to id-send/raw becomes dead before id-send/raw can enqueue the jsexpr value to be sent asynchronously, id-send/raw raises an exception, which will refer to its #:who argument, if given. If the Python worker becomes dead before it has finished producing its response, an exception is raised when the corresponding part of the stream returned by id-send/raw is forced.

A Python worker is dead when it has freed all of its system-level resources, and thus is no longer in communication with a Python process. Programmers must ensure that the worker is dead when they no longer need it, and generally they will need to do so explicitly. The function python-worker-kill causes its argument to become dead immediately; calling it on a Python worker that is already dead has no effect. Using python-worker-kill is equivalent to shutting down the worker’s managing custodian, except that python-worker-kill only effects resources encapsulated by the given Python worker value.

It is almost always best to make a Python worker dead with python-worker-kill or custodian-shutdown-all as soon as you can determine that the worker value is no longer needed. However, calls to launch-id incur significant overhead, so it is much better to reuse Python workers than to create and free them repeatedly. Even better, by consulting id-revision, you may be able to avoid calling launch-id in the first place.

Even if it is never subjected to python-worker-kill or custodian-shutdown-all, a Python worker may still become dead for other reasons. In particular, a Python worker will become dead if the Python process it manages exits of its own accord, either successfully or, for example, due to an unhandled exception. Nonetheless, this possibility does not relieve programmers of the burden of ensuring that the all Python workers do, in fact, actually become dead. Even if a Python worker type implements a comunication protocol in which the Python module is expected to exit, the Racket side of the communication should still check that the worker actually is dead: if it isn’t, the Racket side should clean up and signal that the invariants of the communication protocol have been violated.

The function python-worker-dead-evt takes any Python worker value and produces a synchronizable event that becomes ready for synchronization when the worker is dead. The event’s synchronization result is either #false or an exn:fail that caused the worker to become dead. Currently, a #false result does not necessarily mean that the worker became dead “normally”: this may be improved in the future.

Conversely, the predicate python-worker-running? recognizes any Python worker that is not currently dead. Taking a point-in-time snapshot has some limitations, but python-worker-running? has the benefit of being an inexpensive first-order test.

value
python-revision-value/c : flat-contract?
=
(or/c #f
exact-integer?
(listof python-revision-value/c))

Documentation forthcoming.

2.2 Python–Racket Bridge Language🔗ℹ

#lang pydrnlp/support/python-lang

package: pydrnlp

Documentation forthcoming.

If you wish to begin your module with both a #! line and a coding declaration, the following lines satisfy the Python, Racket, and Emacs parsers simultaneously:

#!/usr/bin/env python3
#lang pydrnlp/support/python-lang # -*- coding: utf-8 -*-
"""Module docstring

Lots of great documentation here ...
"""

In Python 3, the default encoding is UTF-8.

Python revision function

2.3 Scribbling Python Documentation🔗ℹ

Documentation forthcoming.

2.4 Python Utility Modules🔗ℹ

2.4.1 pydrnlp.language🔗ℹ

import pydrnlp.language

package: pydrnlp

Uniform, lazy loading of SpaCy language models

The primary entry point for Spacy functionality is through a spacy.language.Language object, which must be loaded through a language-specific model package. Loading a language is expensive and should not be repeated. Additionally, pydrnlp should be change which model we use for each supported language without necessitating changes in every Python module that needs to use spacy.language.Language objects.

The functionality is supported through pydrnlp.language.get(), which is similar in spirit to spacy.load().

TODO: en_core_web_lg sounds like it has pre-trained models and may give better accuracy. OTOH it is an order of magnitude bigger (667 vs 36 MB).

procedure
(revision) → python-revision-value/c
= (let* () 0)

Python method
def get(lang_str)

Returns a spacy.language.Language instance for the given IANA string.

Models from SpaCy are loaded lazily.

2.4.2 pydrnlp.jsonio🔗ℹ

import pydrnlp.jsonio

package: pydrnlp

Provides classes for programs that loop over JSON IO.

This module imposes the invariant that JSON values must be delimited by newlines (i.e. "\n") and that the JSON values in input may not use the newline character internally, even where insignificant whitespace is allowed by the JSON spec. Using newlines as a delimiter avoids a limitation of Python JSON parsers, which block until they encounter an EOF. This module also writes JSON with a terminating newline, though Racket’s JSON parser doesn’t need this.

Python method
def start_loop(on_input, *, [description])

1	“Trends” Tool
2	Framework for Implementing Additional Tools
	Bibliography

2.1	Managing Python Processes from Racket
2.2	Python–Racket Bridge Language
2.3	Scribbling Python Documentation
2.4	Python Utility Modules