2 Getting Started

8.12

2 Getting Started🔗ℹ

This section explains the process of converting a scanned PDF document to an initial TEI XML file that satisfies our project’s requirements.

The first step in this process is to extract the plain text from the PDF document using Optical Character Recognition (OCR) software. The plain text should be saved in a file with the extension .txt. Our current strategy is to take care at this initial stage to produce the highest-quality plain text version possible: for example, we will attempt to remove purely decorative page headers and footers at this stage. You will receive additional guidance if you are participating in this step of the process. The most important requirement for this step is that we must use the OCR software to mark each page-break in the generated plain text file with an ASCII “form feed” character.

The “form feed” character is often notated as "\f". In Racket, it is #\page, which is the value of (integer->char 12).

After producing a plain text file, the next step is to transform it into an intial TEI XML document. Our GUI tool “TEI Lint” (see TEI Lint, below) performs most of this process automatically: you need only fill in a form with some basic information about the document. A few of the details are worth addressing specificly:

You will need to provide a title for the document, which will be used to fill in the TEI title element. The title needs to be sufficient to unambigously refer to the specific instance in question. In some cases, this means that it will be necessary to add disambiguating information: for example, use “Introduction [in Philosophical Foundations of Human Rights]” rather than simply “Introduction.”
You will also asked be asked to provide a human-readable citation specifying the source from which the digitized document was created—for example, as drawn from the “Books in English” spreadsheet in our Google Drive folder. This will become part of the content of the bibl element.
Each document will automatically include an author element representing Paul Ricœur. You must also enter information on any additional authors, editors, translators, etc. using the “TEI Lint” interface. This information will be used to generate additional author and editor elements.
A particularly important part of the initial encoding process is adding page numbers. The “TEI Lint” interface requires you to account for every page in the document. Pages my be assigned Roman or Arabic numerals or be marked as unnumbered in the source, and consecutive pages may be numbered in one step. For example, a book might begin with 5 unnumbered pages, then 12 pages with Roman numerals beginning with “i,” and finally 403 pages with Arabic numerals beginning with “1.” An article, on the other hand, might consist of 18 pages with Arabic numerals beginning with “386.” The numbering information will be used to generate pb elements.

Once you have provided all of the required information, “TEI Lint” will allow you to save the document as a TEI XML file (with the extension .xml). The finished files should generally be added to the “TEI” directory of the “texts” repository, which is the basis of the corpus available through the portal website.

2.1 Minimal Template🔗ℹ

The initial TEI XML documents generated by “TEI Lint” are structured according to a “minimal template” documented here. You will need to understand the general structure of this template in order to proceed to the additional encoding tasks documented under Refining the Encoding. You should also follow this template if you are prepairing a TEI XML document manually, without using “TEI Lint.”

If you are prepairing a TEI XML document manually, before adding any XML markup, you must replace the reserved characters < and & in the plain text file with the XML entities < and &, respectively. The command-line tool raco ricoeur/tei encode-xml-entities can do this automatically: see raco ricoeur/tei for details.

Initially, we enclose all of the text in a single ab (“anonymous block”) element, which is a container for marked-up text (including page-breaks marked with pb elements) that does not specify any semantic meaning. This is a compromise, allowing us to achieve a valid initial TEI encoding without spending the time to manually mark sections and paragraphs.

In the following example, syntax typeset like this should appear verbatim. Keywords typeset like this indicate sections that should be filled in with a specific type of content, which is explained below the example. Whitespace (including indentation) is not significant.

"example.xml"
<?xml version="1.0" encoding="utf-8"?>
<TEI version="5.0" xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        title-statement
      </titleStmt>
      <publicationStmt>
        <authority>Digital Ricoeur</authority>
        <availability status="restricted">
          <p>Not for distribution.</p>
        </availability>
      </publicationStmt>
      <sourceDesc>
        <bibl>
          source-citation
        </bibl>
      </sourceDesc>
    </fileDesc>
    <profileDesc>
      <textClass>
        <catRef scheme="https://schema.digitalricoeur.org/
                          taxonomy/type"
                target=book/article-target />
        <keywords scheme="https://schema.digitalricoeur.org/
                            tools/tei-guess-paragraphs">
          <term>todo</term>
        </keywords>
      </textClass>
    </profileDesc>
  </teiHeader>
  <text xml:lang=text-lang>
    <body>
      <ab>
        main-text
      </ab>
    </body>
  </text>
</TEI>

title-statement

The title-statement (i.e. the body of the titleStmt element) must contain a single title element, one or more author elements, and any number of editor elements.
There must always be an author element representing Paul Ricœur, which should be exactly as follows:
<author xml:id="ricoeur">Paul Ricoeur</author>

source-citation

The source-citation (i.e. the content of the bibl element) should be free-form text specifying the source from which the digitized document was created—for example, as drawn from the “Books in English” spreadsheet in our Google Drive folder.
The parts of the citation that refer to the publication date must be wrapped in date elements. The source-citation must contain either one or two date elements, depending on whether the source from which the document was prepared is the first published version in any language: see the formal specification of the bibl and date elements for more details and examples.

book/article-target

The book/article-target (i.e. the value of the target attribute of the catRef element) must be either:
"https://schema.digitalricoeur.org/taxonomy/type#article", if the document is an article; or
"https://schema.digitalricoeur.org/taxonomy/type#book", if the document is a book.

text-lang

The text-lang (i.e. the value of the xml:lang attribute of the text element) must be either:
"en" if the document is primarily in English;
"fr" if the document is primarily in French; or
"de" if the document is primarily in German.

main-text

The main-text part of the template is where the actual digitized text should be included, including pb elements to mark page breaks.
A pb element must be inserted to mark the beginning of every page, including the first page. Unless the page is not assigned a number in the scanned original, the element should include the n attribute to denote the page number, perhaps as a Roman numeral.
When prepairing the text, we should be careful to practice non-destructive editing. For example, while we aren’t focusing on adding note tags at this stage, we should leave the numbers in place for footnotes and endnotes so that we can add them later. However, we should remove redundant “decorative” text (like the title of a book printed at the top of every page) that isn’t really part of the work itself, and it is always good to correct OCR errors if we see them.

1	Background: XML and TEI
2	Getting Started
3	Refining the Encoding
4	Formal Specification
5	Tools